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1  Summary  of  accomplishments 

Our  efforts  in  these  twelve  quarters  resulted  in  four  patents  (two  pending,  two  provisional  appli¬ 
cations),  seven  journal  papers,  seven  conference  papers  and  numerous  professional  and  scholarly 
contributions,  details  of  which  are  presented  below. 


•  Patents  Pending 

—  Y.  Altunbasak,  B.  Gunturk,  and  R.  M.  Mersereau,  “Resolution  enhancement  and  artifact 
reduction  for  MPEG  video,”  filed  with  the  Patent  Office  on  April  26,  2002. 

-  Y.  Altunbasak  and  A.  Patti,  “A  fast  and  robust  model-based  motion  estimation  strat¬ 
egy,”  filed  with  the  Patent  Office,  Sept.  2002. 

•  Provisional  Patent  Applications 

—  Y.  Altunbasak  and  Y.  C.  Lee,  “Multi-frame  error  concealment  methods  for  transform- 
coded  video,”  (GTRC  ID  2509)  filed  with  the  USPTO  on  June  27,  2001 

-  B.  Gunturk,  A.  U.  Batur,  Y.  Altunbasak,  M.  H.  Hayes  III,  and  R.  M.  Mersereau,  Super¬ 
resolution  for  face  recognition,  (GTRC  ID  2653)  filed  with  the  USPTO  on  May  3,  2002. 

•  Journal  Papers 

-  B.  Gunturk,  Y.  Altunbasak,  and  R.  M.  Mersereau,  “A  DCT-domain  bayesian  resolution 
enhancement  framework  for  MPEG  video,”  IEEE  Transactions  on  Image  Processing, 
vol.  13,  no.  1,  pp.  33-  43,  January  2004. 

-  B.  Gunturk,  J.  Glotzbach,  Y.  Altunbasak,  R.  W.  Schafer,  and  R.  M.  Mersereau,  “De- 
mosaicking:  Color  filter  array  interpolation  in  single  chip  digital  cameras,”  IEEE  Signal 
Processing  Magazine  (Special  Issue  on  Color  Image  Processing),  September  2004. 

—  B.  Gunturk,  Y.  Altunbasak,  and  R.  M.  Mersereau,  “Multiframe  resolution-enhancement 
methods  for  compressed  video,”  IEEE  Signal  Processing  Letters,  vol.  9,  no.  6,  pp. 
170-174,  June  2002. 
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—  B.  Gunturk,  Y.  Altunbasak,  and  R.  M.  Mersereau,  “A  multi-frame  blocking  artifact  re¬ 
duction  method  for  transform  coded  video,”  IEEE  Transactions  on  Circuits  and  Systems 
for  Video  Technology,  vol.  12,  no.  4,  pp.  269-276,  April  2002. 

—  B.  K.  Gunturk,  A.  U.  Batur,  Y.  Altunbasak,  M.  H.  Hayes  III,  and  R.  M.  Mersereau, 
“Eigenface-domain  super-resolution  for  face  recognition,”  IEEE  Transactions  on  Image 
Processing,  May  2003. 

—  Y.  Altunbasak,  A.  J.  Patti,  and  R.  M.  Mersereau,  Super-resolution  still  and  video  re¬ 
construction  from  MPEG  coded  video,  IEEE  Transactions  on  Circuits  and  Systems  for 
Video  Technology,  vol.  12,  no.  4,  pp.  217-227,  April  2002. 

—  B.  Gunturk,  Y.  Altunbasak,  and  R.  M.  Mersereau,  Color  plane  interpolation  using  al¬ 
ternating  projections,  IEEE  Transactions  on  Image  Processing,  vol.  11,  no.  9,  pp. 
997-1013,  September  2002. 

•  Conference  Papers 

—  B.  K.  Gunturk,  Y.  Altunbasak,  and  R.  M.  Mersereau,  “Multi-frame  information  fusion 
for  gray-scale  and  spatial  enhancement  of  images,”  in  Proc.  IEEE  Int.  Conf.  Image 
Processing,  Barcelona,  Spain,  Sep.  2003. 

—  B.  Gunturk,  A.  Batur,  Y.  Altunbasak,  M.  H.  Hayes  III,  and  R.  M.  Mersereau,  “Eigenface- 
based  super-resolution  for  face  recognition,”  IEEE  Int.  Conf.  on  Image  Processing,  vol. 
2,  pp.  845-848,  Rochester,  NY,  September  2002. 

—  B.  Gunturk,  Y.  Altunbasak,  and  R.  M.  Mersereau,  “A  multi-frame  blocking  artifact 
reduction  method  for  transform  coded  video,”  IEEE  Int.  Conf.  Acoust.  Speech  Sign. 
Proc.,  vol.  3,  pp.  1789-1792,  Salt  Lake  City,  UT,  May  2001. 

—  B.  Gunturk,  Y.  Altunbasak,  and  R.  M.  Mersereau,  “A  bayesian  resolution  enhancement 
framework  for  transform  coded  video,”  Proc.  IEEE  Int.  Conf.  Image  Proc.,  vol.  2,  pp. 
41-44,  Thessaloniki,  Greece,  Oct.  2001. 

—  B.  Gunturk,  Y.  Altunbasak,  and  R.  M.  Mersereau,  “Gray-scale  resolution  enhancement,” 
IEEE  Workshop  on  Multimedia  Signal  Processing,  pp.  155-160,  Cannes,  Prance,  Oct. 
2001. 

—  B.  Gunturk,  Y.  Altunbasak,  and  R.  M.  Mersereau,  Color  plane  interpolation  using  al¬ 
ternating  projections,  IEEE  Int.  Conf.  on  Acoustics  Speech  and  Signal  Processing,  vol. 
4,  pp.  3333-3336,  Orlando,  FL,  May  2002. 

—  Toygar  Akgun,  Yucel  Altunbasak  and  R.  M.  Mersereau,  “Superresolution  reconstruction 
of  hyperspectral  images,”  in  Proc.  IEEE  Int.  Conf.  on  Acoustics  Speech  and  Signal 
Processing  (ICASSP),  Montreal,  Canada,  May  2004 

•  Main  Accomplishments 

—  Multi-frame  Blocking  Artifact  Reduction  Method  for  Transform  Coded  Video  and  its 
Relation  to  Superresolution  (Section  4.1  ) 

—  DCT-Domain  Bayesian  Superresolution  Reconstruction  for  MPEG  Video  (Section  4.2  ) 

—  Fast  Motion  Estimation  Using  Low-bit-resolution  Images  (Section  4.3  ) 

—  Multi-Frame  Gray-Scale  Resolution  Enhancement  (Section  4.4  ) 

—  Face  Recognition  from  Video  (Section  4.5  ) 
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—  Effects  of  Camera  Response  Function  and  Illumination  Changes  in  Multi-Frame  Image 
Reconstruction  (Section  4.6  ) 

—  Multi-Frame  Information  Fusion  for  Gray-Scale  and  Spatial  Enhancement  of  Images 
(Section  4.7  ) 

—  Super-Resolution  Reconstruction  of  Hyper-Spectral  Images  (Section  4.8  ) 


2  Professional  and  Scholarly  Contributions 

•  Leadership  Activities 

—  Technical  program  chair,  IEEE  Int.  Conf.  on  Image  Processing  (ICIP’06),  Atlanta,  GA, 
2006. 

—  Advanced  Signal  Processing  for  Communications”  Symposia  co-chair,  IEEE  Interna¬ 
tional  Conference  on  Communications  (ICC’03),  May  11-15,  2003 

-  Multimedia  Networking  Technical  Track  Chair,  IEEE  International  Conference  on  Mul¬ 
timedia  and  Expo  (ICME’04),  June  27-30,  2004 

—  Multimedia  Networking  Technical  Track  Chair,  IEEE  International  Conference  on  Mul¬ 
timedia  and  Expo  (ICME’03),  July  6-9,  2003 

—  Panel  sessions  chair,  International  Conference  on  Information  Technology:  Research  and 
Education  (ITRE’03),  August  10-13,  2003 

-  Session  chair,  Video  Transcoding,  IEEE  Int.  Conf.  on  Image  Processing  (ICIP’03), 
September  15,  2003 

—  Session  chair,  Wireless  Multimedia  I,  IEEE  Int.  Conf.  Multimedia  and  Expo  (ICME), 
Baltimore,  MD,  July  7,  2003 

—  Session  chair,  Multimedia  I,  IEEE  Int.  Conf.  on  Communications  (ICC’03),  May  12, 
2003 

-  Session  chair,  Wireless  networking,  IEEE  Int.  Conf.  on  Communications  (ICC’03),  May 
14,2003 

—  Session  chair,  Error  Concealment,  IEEE  Int.  Conf.  on  Image  Processing  (ICIP’02), 
Sept.  22-25,  2002 

-  Session  chair,  Image/Video  Coding,  IEEE  Int.  Conf.  Acoust.  Speech  Sign.  Proc. 
(ICASSP’02),  May  10-14,  2002 

—  Session  chair,  Digital  Video  Processing,  32nd  Asilomar  Conference  on  Signals,  Systems, 
and  Computers,  Nov.  1-4,  1998 

•  Editorial  Activities 

—  Associate  Editor,  IEEE  Transactions  on  Image  Processing,  5/26/2002-6/1/2005 

—  Associate  Editor,  IEEE  Transactions  on  Signal  Processing,  7/31/2003-7/31/2005 

—  Area  Editor,  Signal  Processing:  Image  Communications,  1/1/2001-1/1/2004 

—  Associate  Editor,  Circuits,  Systems,  and  Signal  Processing,  6/21/2000-12/31/2002 
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-  Guest  Editor,  EURASIP  Image  Communications  Special  Issue  on  ’’Recent  Advances  in 
Wireless  Video” 

-  Panelist  and  proposal  reviewer  for  NSF,  June  2001,  Feb.  2003,  and  May  2003 

—  Panelist  and  proposal  reviewer  for  ARO,  March  2001 

•  Membership  Activities 

-  Advisory  Board,  Department  of  Electrical  Engineering,  Bilkent  University,  Ankara, 
Turkey 

—  Vice-president  -  North  America,  IEEE  Communications  Society  Multimedia  Communi¬ 
cations  Technical  Society,  1/2004-1/2006 

—  Elected  member,  IEEE  Signal  Processing  Society  Image  and  Multi-dimensional  Signal 
Processing  (IMDSP)  Technical  Committee,  5/2002-5/2008 

—  Senior  Member,  IEEE,  2002 

-  Technical  program  committee  member,  ICME’04,  June  2004,  Taipei,  Taiwan 

-  Technical  program  committee  member,  ITRE’04,  June  2004,  London,  England 

-  Technical  program  committee  member,  ICASSP’04,  Montreal,  Canada,  May  2004 

-  Technical  program  committee  member,  ICC’04,  Paris,  France,  June  2004 

—  Technical  program  committee  member,  ICIP’03,  Barcelona,  Spain,  September  2003 

-  Technical  program  committee  member,  ICME’03,  July  2003,  Baltimore,  MD 

-  Technical  program  committee  member,  ICASSP’03,  Hong  Kong,  China,  April  2003 

—  Technical  program  committee  member,  Wireless  Personal  Multimedia  Communications 
(WPMC’02) 

-  Technical  program  committee  member,  IEEE  Int.  Conf.  on  Image  Processing  (ICIP’02), 
Rochester,  NY,  September,  2002 

-  Technical  program  committee  member,  ICIP’01,  Thessaloniki,  Greece,  October  2001 

—  Technical  program  committee  member,  IEEE  International  Symposium  on  Circuits  and 
Systems  (ISCAS’02),  Scottsdale,  Arizona,  May  2002 

-  Technical  program  committee  member,  ICASSP’01,  Salt  Lake  City,  Utah,  May  2001 


3  Honors  and  Awards 

•  Received  NSF  CAREER  award  (2002) 

•  Received  ’’Outstanding  Junior  Faculty  Award”  at  the  School  of  Electrical  and  Computer 
Engineering,  Georgia  Tech  (2003) 

•  Co-author  for  a  conference  paper  that  received  the  ’’best  student  paper  award”  at  ICIP  2003 
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4  Main  Accomplishments  and  Results 

4.1  Multi-frame  Blocking  Artifact  Reduction  Method  for  Transform  Coded 
Video 

Transform  coding  is  a  popular  and  effective  compression  method  for  both  still  images  and  video 
sequences,  as  is  evident  from  its  widespread  use  in  international  media  coding  standards  such 
as  MPEG,  H.263  and  JPEG.  The  motion-compensated  image  (or  the  image  itself)  is  divided  into 
blocks  and  each  block  is  independently  transformed  by  a  2-D  orthogonal  transform  to  achieve  energy 
compaction.  The  most  commonly  used  transform  is  the  discrete  cosine  transform  (DOT).  After 
the  block  transform,  the  transform  coefficients  undergo  a  quantization  step.  At  low  bit-rates,  the 
DCT  coefficients  are  coarsely  quantized.  This  coarse  quantization  with  independent  quantization 
of  neighboring  blocks  gives  rise  to  blocking  artifacts — visible  block  boundaries. 

Several  blocking  artifact  reduction  methods  have  been  proposed  in  the  literature.  Spatial  filtering, 
iterative  reconstruction  techniques,  and  stochastic  reconstruction  techniques  are  among  the  blocking 
artifact  reduction  methods  that  have  been  proposed  for  still  images.  Although  temporal  information 
adds  another  dimension  for  video  sequences,  none  of  the  prior  art  used  this  information  effectively 
in  blocking  artifact  reduction  in  video. 

In  this  work,  we  develop  a  multi-frame  restoration-based  method  that  makes  use  of  the  spatial  cor¬ 
relations  between  successive  frames  effectively.  The  proposed  method  constructs  convex  constraint 
sets  at  each  frame  within  a  neighborhood  of  the  frame  of  interest,  using  the  motion  between  the 
frames  and  the  quantization  information  extracted  from  the  bit  stream.  The  method  is  based  on  the 
fact  that,  although  the  exact  value  of  the  quantization  noise  added  to  each  DCT  coefficient  is  not 
known,  the  range  within  which  it  lies  can  be  determined  from  the  bit  stream.  Incorporating  the 
motion  between  the  frames,  we  can  define  constraint  sets  not  only  at  the  current  frame,  but  also 
at  each  frame  within  a  small  neighborhood  of  the  current  frame.  By  projecting  the  initial  “blocky” 
frame  onto  these  constraint  sets  successively,  we  can  reconstruct  a  better  estimate  of  the  “original” 
frame — the  one  before  the  quantization  step. 

In  order  to  derive  the  constraint  sets,  we  first  need  to  establish  the  relation  between  neighboring 
frames.  The  key  is  to  use  the  intensity  conservation  assumption  along  the  motion  trajectories. 
Let  /(x,t)  denote  the  intensity  of  the  continuous  spatio-temporal  video  signal  at  spatial  coordinate 
x  =  [xi,  X2]  at  time  t.  Pixel  intensities  of  any  two  video  frames  can  be  related  to  each  other  through 
the  motion  vectors.  Denoting  M  =  [M\ (x,  4;  tj),  M^x.,  4;  tj )]  as  the  motion  mapping  between  the 
frames  at  times  4  and  tj,  we  can  write: 

/(x,4)  =  /(x  -  M,tj).  (1) 

We  now  proceed  by  relating  the  spatially  continuous  video  frame  at  time  tj  to  the  corresponding 
discrete  frame.  Letting  /( n,  j)  denote  the  intensity  of  the  jth  discrete  frame  at  the  integer  coordinate 
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n  =  [m ,  TI2] ,  we  can  write  the  spatially-continuous  reconstruction  as: 


*  hr  (x) 


=  p/(nj)<$(x-n) 

=  f  '52f(n,j)8(t-n)hr(-x-£)d£ 

J  n 

=  ^hr(x-n)/(n,j), 


(2) 


where  hr(x)  is  the  reconstruction  filter.  Substituting  Equation  (2)  into  Equation  (1),  and  discretiz¬ 
ing  kth  frame,  we  get: 


f(m,k)  =  ^  hr(m  —  M  —  n)f(n,j). 


(3) 


Now  we  model  the  operations  that  take  place  in  the  process  of  MPEG  compression  ( i.e .,  motion 
compensation,  block-DCT  calculation,  and  quantization).  Motion  compensation  is  simply  the 
subtraction  of  an  offset  value  f(m,k)  (predicted  image)  from  /(m,  k).  The  residual  image  is 
then  passed  through  a  series  of  8  x  8  block-DCTs  to  result  in  DCT  coefficients.  This  is  followed  by 
a  quantization  step  which  can  be  modeled  as  an  addition  of  quantization  error.  Using  Equation  3, 
the  overall  relation  can  be  written  as: 

dq(m,k )  =  ^2hK(m,k-,n,j)f(n,j)  -  F(m,k)  +  Q(m,k),  (4) 

n 

where  dq( m,  k )  is  the  quantized  DCT  coefficients,  k ;  n,  j )  is  the  block-DCT  of  hr  (m  -  M  -  n), 

F(m,  k)  is  the  block-DCT  of  /(m,  &),  and  Q(m,  k)  is  the  quantization  error. 


Although  the  exact  value  of  Q(m,  k)  is  not  known,  the  range  within  which  the  DCT  coefficient 
lies  can  be  extracted  from  the  MPEG  bit  stream.  Based  on  this  fact  we  define  constraint  sets 
C(m,  k)  on  frame  Defining  fy(m,  k)  and  bu(m,  k)  as  the  lower  and  upper  bounds  of  the 

DCT  coefficient  at  spatio-temporal  location  (m ,fc),  C(m,  k)  can  be  written  as: 


C(m,k) 


/(m,j) 


Y^hK(m,k;n,j)f(n,j)  -  F(m,  k) 
-  n 


€  [ bi(m,k),bu(m,k )] 


(5) 


This  equation  shows  how  to  define  constraint  sets  on  any  frame  j  using  the  quantization  information 
from  another  frame  k.  By  projecting  the  “blocky”  frame  onto  these  constraint  sets,  the  blocking 
artifacts  can  be  reduced  significantly. 


This  constraint  set  is  defined  simply  by  using  the  quantization  bounds  [fy(m),  6u(m)]  of  one  of  the 
DCT  coefficients  as  explained  in  the  previous  section.  Projection  of  an  image  onto  this  constraint 
set  amounts  to  simply  finding  the  closest  point  on  C(m,  k). 

Although  this  projection  is  likely  to  reduce  the  blocking  artifacts,  it  does  not  guarantee  a  significant 
improvement  since  the  “original”  blocking-artifact-free  image  could  be  anywhere  in  the  shaded 
region.  Defining  another  constraint  set  could  improve  the  quality  significantly.  As  depicted  in 
Figure  43,  the  second  constraint  set  defined  with  help  of  the  neighboring  frame  ( k  +  1)  reduces 
the  region  where  the  “original”  image  lies.  Projecting  the  initial  frame  onto  these  convex  sets 
successively  produces  a  better  result.  By  using  additional  frames  we  can  impose  more  constraint 
sets  onto  the  reconstructed  frame  and  reduce  the  blocking  artifacts  further. 
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4.2  DCT-Domain  Bayesian  Superresolution  Reconstruction  for  MPEG  Video 


With  a  Bayesian  estimator,  not  only  the  source  statistics,  but  also  various  regularizing  constraints 
can  be  incorporated  into  the  solution.  Bayesian  estimators  have  been  frequently  used  for  super¬ 
resolution.  However,  in  these  approaches  either  the  video  source  is  assumed  to  be  available  in 
uncompressed  form,  or  it  is  simply  decompressed  without  considering  the  quantization  process. 
Additive  noise  is  considered  as  the  only  source  of  error.  On  the  other  hand,  the  projection  onto 
convex  sets  (POCS)  techniques  treat  the  quantization  error  as  the  only  source  of  error  without 
considering  the  additive  noise.  Clearly,  neither  of  these  approaches  provides  a  complete  framework 
for  superresolution.  The  model  that  we  have  developed  combines  the  quantization  process  and  the 
additive  noise  under  a  Bayesian  framework. 


We  begin  by  analyzing  the  block  diagram  in  Figure  1  depicting  the  video  acquisition  process. 
According  to  this  model,  a  spatially  and  temporally  continuous  high-resolution  input  signal  /(x,  t ) 
is  affected  by  sensor  and  optical  blurs.  Sensor  blur  is  caused  by  integrating  over  the  finite  nonzero 
sensor  cell  area,  while  optical  blur  is  caused  by  the  lens  system.  The  blurred  video  signal  is  also 
integrated  over  time  to  capture  nonzero  time-aperture  effects.  After  sampling  on  a  low-resolution 
grid,  discrete  low-resolution  frames  </<*(!,  k)  are  obtained. 


We  now  add  the  MPEG  compression  stages  to  this  model.  As  shown  in  Figure  2,  the  LR  frame 
9d{  1)^)  is  motion  compensated  (i.e.,  the  prediction  frame  is  computed  and  subtracted  from  the 
original  to  get  the  residual  image),  and  the  residual  is  transformed  using  a  series  of  8  x  8  block- 
DCTs  to  produce  the  DCT  coefficients  d(m,k).  The  DCT  coefficients  d(m,k)  are  then  quantized 
to  produce  the  quantized  DCT  coefficients  d( m,  k). 


In  the  maximum  a  posteriori  probability  (MAP)  formulation,  the  quantized  DCT  coefficients, 
d(m,A;),  the  original  high-resolution  frame,  /(n,fr),  and  the  block  DCT  of  the  additive  noise 
are  all  assumed  to  be  random  processes.  Denoting  P/(n>ir)|j(m)fcl))...!j(mifcp)(0  as  the  conditional 
probability  density  function  (PDF)  with  Asi ,  ♦  •  -  ,  kp  being  the  frames  used  in  the  reconstruction, 
the  MAP  estimate  /( n,  tr )  is  given  by: 


/(n,tr) 


arg  max 

/(n,ir) 


,d(rn:kp) 


V  n. 


(6) 


We  used  the  underline  notation  (in  n  and  m)  to  emphasize  that  this  PDF  is  the  joint  PDF,  not 
the  PDF  at  a  specific  location. 


Using  Bayes’  rule,  Equation  6  can  be  rewritten  as: 

f(n,tr)  =  argm «{p^)f...f^)1/(atp)(0P/(B,tr)(0}  >  V  n, 


(7) 


where  we  used  the  fact  that  ,d(m,kp)(')  is  not  a  function  of  /(n, tr).  Clearly  we  need 

to  model  the  conditional  PDF  %(m1fe1),’-,d(m1fcP)|/(n,tr)(-)  and  the  prior  PDF  *>/(„, tr)(-)  in  order 

to  find  the  MAP  estimate  f(n,tr).  If  the  additive  noise  v(m,  k)  is  assumed  to  be  an  indepen¬ 
dent,  identically  distributed  (IID)  Gaussian  process,  it  is  possible  to  derive  the  conditional  PDF 
Pd(m,ki),-~ ,d{m,kp)\f(n,t) (')>  analytically,  which  can  be  used  with  any  prior  image  model.  The  resulting 
estimation  problem  can  be  implemented  using  an  iterated  conditional  modes  (ICM)  scheme.  (The 
details  of  the  derivation  and  the  implementation  are  presented  in  the  attached  paper.) 
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To  demonstrate  the  efficacy  of  the  proposed  method  we  performed  a  set  of  experiments.  In  these 
experiments,  the  test  images  Aerial  (Figure  3)  and  Watch  (Figure  7)  were  jittered,  blurred, 
downsampled,  and  corrupted  by  additive  noise  to  produce  multiple  low-resolution  frames.  For 
blurring,  the  images  were  convolved  with  a  5  x  5  Gaussian  kernel  with  a  standard  deviation  of 
1.5  pixels.  The  additive  noise  had  a  Gaussian  distribution  with  a  standard  deviation  1.4.  The 
aperture  time  was  taken  to  be  zero,  and  the  lens  point-spread  function  was  assumed  to  be  space 
invariant.  The  low-resolution  frames  were  then  compressed  using  an  MPEG  encoder  operating  at 
1.5Mbits/sec  to  produce  the  low  resolution  video  sequence.  Figure  4  and  Figure  8  show  one  of 
those  low  resolution  frames  from  each  of  the  sequences. 

The  first  decoded  frame  for  each  video  sequence  was  bilinearly  interpolated.  These  images  are 
depicted  in  Figures  5  and  9.  We  then  applied  our  compressed-domain  resolution  enhancement 
method  to  upsample  the  first  frames  for  both  sequences.  Figures  6  and  10  illustrate  the  results 
obtained  by  the  proposed  algorithm. 

i 
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I  ; 

Noise  | 
v(U)  i 


'  o  - 

4 


gA  U) 


/(M) 


Video  acquisition  process 


Figure  1:  Video  acquisition  model. 


/CM) 


Figure  2:  MPEG  compression  is  appended  to  the  video  acquisition. 


4.3  Fast  Motion  Estimation  Using  Low-bit-resolution  Images 

Reducing  the  temporal  redundancy  in  image  sequences,  motion  estimation  is  widely  used  in  the 
video  processing  algorithms.  Among  different  motion  estimation  techniques,  block-matching  al¬ 
gorithm  (BMA)  is  the  most  popular  one  due  to  its  high  performance  and  low  hardware  cost. 
Block-matching  algorithms  find  the  displacement  (motion  vector)  minimizing  the  matching  dif¬ 
ference  between  the  reference  block  and  the  candidate  blocks.  When  the  search  range  is  large, 
BMAs  become  computationally  costly,  which  is  a  significant  problem  for  real-time  video  processing 
applications.  There  are  various  fast  motion  estimation  approaches  proposed  to  reduce  that  compu¬ 
tational  cost.  Unimodal  error  surface  assumption  techniques,  multiresolution  techniques,  integral 
projection  techniques,  variable  search  range  techniques,  lower  bit-resolution  techniques  are  among 
them. 

The  focus  of  this  report  will  be  the  low  bit-resolution  BMAs.  In  an  early  work,  Gharavi  and  Mills 
[1]  proposed  a  fast  BMA  based  on  one-bit  quantization  of  the  pixels.  Natarajan  et  al.  [2]  made  that 
algorithm  even  faster  by  using  exclusive-OR  (XOR)  operation  for  the  block-matching  criterion. 
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*  A*  **'• 


Figure  4:  A  frame  from  the  synthetically  generated  video  sequence 


Although  the  XOR  based  block-matching  criteria  reduce  the  computational  load  significantly  for 
one-bit  resolution  algorithms,  the  accuracy  of  the  motion  estimation  is  a  big  concern.  Back  et  al. 


T 


Figure  5:  Bilinear  interpolation. 


[3]  compared  the  performances  of  different  bit-resolution  images  using  reduced-bit  mean  absolute 
difference  as  the  matching  criterion.  Lee  et  al.  [4]  presented  a  two-bit  adaptive  quantization  scheme 
to  increase  the  accuracy  of  the  motion  estimation.  They  also  introduced  a  new  block-matching 
criterion,  Different  Pixel  Count  (DPC),  providing  less  hardware  cost  than  the  other  reduced-bit 
matching  criteria  such  as  Pel  Difference  Classification  [1]  and  Bit  Truncation  [3].  But  all  these 
block-matching  criteria  do  not  still  come  close  to  the  efficiency  of  the  simple  XOR  based  methods. 

In  this  work,  we  propose  a  new  two-bit  BMA  that  uses  XOR  operation  as  the  correlation  engine. 
Three-level  quantization  scheme  is  the  key  in  this  algorithm,  enabling  the  use  of  XOR  operation. 
The  proposed  algorithm  is  more  accurate  than  the  one-bit  methods. 


4.3.1  Three-level  quantization  of  images 

The  block  diagram  of  the  proposed  algorithm  is  given  in  Figure  11.  The  input  images  are  first 
band-pass  filtered  and  then  quantized  to  three  levels  (Q3).  The  band-pass  filter  (BPF)  is  designed 
to  remove  the  DC  so  that  the  overall  method  is  less  susceptible  to  the  errors  due  to  brightness 
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Figure  6:  Proposed  algorithm  (8  frames  used). 


changes,  and  also  remove  the  high-frequency  noise.  By  quantizing  the  images  to  three  levels,  the 
accuracy  of  the  motion  estimation  is  increased  in  comparison  to  the  two-level  quantization  methods 
while  providing  a  fast  implementation  scheme  based  on  XOR  operation  at  the  same  time. 

For  the  three-level  quantization,  the  levels  are  represented  by  the  two  bit  pairs:  10,  00  or  01,  corre¬ 
sponding  to  the  pixel  values  less  than  Tl,  between  T1  and  T2,  and  greater  than  T2,  respectively, 
where  Tl  and  T2  are  the  thresholds.  Upon  applying  the  XOR  we  have  the  following  combinations 
and  results: 


00  XOR  00  -  00,  10  XOR  10  =  00,  01  XOR  01  =  00 
00  XOR  01  =  01,  00  XOR  10  =  10,  10  XOR  01  =  11 


From  these  sets  of  possible  comparisons,  we  see  that  the  number  of  1  bits  in  the  result  can  be  a 
measure  of  correlation.  It  gives  zero  1  bits  if  the  pixels  are  highly  correlated  (00  XOR  00,  10  XOR 
10,  etc.),  two  1  bits  if  the  correlation  is  the  lowest  (10  XOR  01),  and  one  1  bit  in  between  (00  XOR 
01,  00  XOR  10).  Therefore,  the  algorithm  will  count  the  number  of  1  bits  in  the  results  of  the  XOR 
operation  between  two  blocks,  and  choose  the  block  with  the  least  number  of  1  bits  as  the  matched 
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Figure  9:  Bilinear  interpolation. 


4.3.2  Band-pass  filter  implementation 


The  band-pass  filter  operation  is  designed  to  achieve  a  fast  and  effective  implementation.  As 
depicted  in  Figure  12,  the  image  is  filtered  by  a  3  x  3  window  of  ones,  and  a  7  x  7  window  of  ones. 
The  difference  of  these  two  filtered  images  gives  the  band-pass  filtered  image.  With  this  kind  of 
scheme,  the  filtering  operation  becomes  extremely  efficient:  Without  any  multiplication,  the  pixels 
under  the  3x3  and  7x7  windows  are  summed  up,  and  then  scaled  once  at  the  end.  This  scaling 
can  also  be  simplified  if  the  the  window  sizes  are  chosen  to  be  factors  of  two,  in  which  case  the 
pixel  values  are  scaled  by  bitwise  shifting  to  the  right. 


4.3.3  Threshold  selection 

Since  the  overall  method  is  aimed  to  be  fast,  the  threshold  selection  technique  is  chosen  to  have  low 
computational  complexity.  The  simplest  way  of  choosing  a  threshold  is  to  choose  a  fixed  threshold, 
that  can  be  implemented  by  bit  truncation.  However,  it  is  obvious  that  with  this  kind  of  threshold 
selection,  the  motion  vectors  cannot  be  computed  accurately.  Another  simple  method  is  to  find  the 
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Figure  10:  Proposed  algorithm  (8  frames  used). 


mean  of  the  image,  and  then  use  a  fixed  offset  value  from  the  mean  to  determine  T1  and  T2.  In 
other  words,  a  fixed  number  is  subtracted  from  the  mean  of  the  image  to  determine  Tl,  and  added 
to  the  mean  to  determine  T2.  We  have  performed  experiments  for  different  video  sequences  to 
determine  an  optimum  threshold  value.  Figure  13  is  a  graph  showing  the  average  PSNR  values  for 
different  quantization  levels  and  band-pass  filters,  as  a  function  of  threshold  values.  In  that  figure, 
Filter  3-5  indicates  that  windows  of  3  x  3  and  5x5  are  used  in  the  band-pass  filter  implementation. 
Similarly,  Filter  3-7  indicates  that  windows  of  3  x  3  and  7x7  are  used.  An  important  observation 
in  the  figure  is  that  three-level  quantization  has  better  performance  than  two-level  quantization  at 
small  threshold  values.  However,  as  the  threshold  value  increases,  the  performance  of  the  three-level 
quantization  becomes  worse  than  two-level  quantization.  Although  this  is  the  plot  for  a  particular 
sequence  ( “Foreman”  sequence) ,  the  same  behaviour  has  been  been  observed  for  all  video  sequences 
tested.  In  our  experiments,  we  have  found  that  an  offset  value  of  one  works  sufficiently  for  the 
bandpass  filter  Filter  3-7,  mentioned  above. 
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4.3.4  Correlation  surface  search 


The  most  important  element  in  this  scheme  is  the  Packed  Correlation  Surface  Search.  As  explained 
before,  the  correlation  operation  of  an  image  region  taken  against  another  image  region  at  some 
shift  can  be  computed  utilizing  the  XOR  operation  on  packed  words.  For  example,  if  a  given 
processor  has  32-bit  data  words,  16  pixels  from  one  image  line  can  be  correlated  against  16  pixels 
from  another  image  line  utilizing  only  a  simple  XOR  operation.  The  correlation  measure  is  simply 
the  sum  of  the  number  of  1  bits  in  the  resultant  32-bit  word.  By  segmenting  each  line  of  the  selected 
block  into  16  pixel  chunks,  the  correlation  over  the  entire  block,  at  a  given  shift,  can  be  computed 
with  a  minimum  number  of  operations.  If  two  bits  are  used  to  represent  each  pixel,  then  two  32-bit 
data  words  can  be  used  to  compute  the  correlation  for  16  pixels. 


4.3.5  Generalization  to  higher-level  quantization 


It  should  be  noted  that  in  two-bit  per  pixel  representation  there  are  actually  four  possible  quanti¬ 
zation  levels.  However,  the  only  way  to  use  the  sum  of  the  number  of  1  bits  in  the  XORed  registers 
as  a  distance  metric  is  to  use  three  different  quantization  levels.  This  can  be  generalized  to  higher 
levels  as  well.  For  instance,  when  the  pixels  are  represented  by  three  bits,  there  are  four  quanti¬ 
zation  levels  possible  if  we  want  to  use  the  sum  of  the  number  of  1  bits  as  the  distance  metric. 
These  levels  can  be  represented  with  000,  001,  Oil,  and  111.  In  this  case,  the  distance  between  two 
intensities  can  be  0,  1,  2,  or  3. 


4.3.6  Experimental  results 


We  have  tested  the  proposed  algorithm  for  various  video  sequences,  including  the  standard  “Susie” , 
“Foreman” ,  “Coast  Guard” ,  “Tennis” ,  and  “Flower  Garden”  sequences.  The  results  are  tabulated 
in  Table  1.  The  size  of  each  frame  for  the  “Foreman”  sequence  is  288  x  352,  for  the  others  it  is 
240  x  352.  We  have  used  two-level  and  three-level  quantization  schemes,  and  compared  it  with 
the  conventional  8-bit  algorithm.  The  motion  vectors  are  computed  for  every  8  x  8  or  16  x  16 
blocks,  and  the  numbers  in  parenthesis  show  this  periodicity.  As  seen  on  the  table,  two-  and 
three-level  quantization  algorithms  have  significantly  lower  search  time  than  the  standard  8-bit 
algorithm  at  the  cost  of  0.5  dB  loss  in  PSNR.  Three-level  quantization  is  0.2  dB  better  than  the 
two-level  quantization  on  the  average.  This  0.2  dB  is  a  critical  amount  considering  the  fact  that 
8-bit  algorithm  is  0.5  dB  better  than  reduced-bit  algorithms  on  the  average. 

The  search  time  results  were  obtained  on  a  Pentium  III  900MHz  processor.  We  can  also  have  an 
approximate  picture  of  the  computational  complexities  in  terms  of  the  number  of  operations  per 
pixel.  For  the  filtering  operation,  there  are  59  flops  (addition  or  multiplication)  required  for  each 
pixel.  (9  for  the  3x3  window  filter,  49  for  the  7x7  filter,  and  1  for  the  final  subtraction.)  For 
the  three-level  quantization  operation,  less  than  1  flop  (addition  or  multiplication)  is  required  per 
pixel.  And  for  the  correlation  search,  only  65  flops  per  pixel  is  necessary  for  a  search  range  of  [-8,  7] 
pixels  (full  search).  The  total  number  of  flops  per  pixel  is  only  125.  This  is  a  significant  reduction 
when  compared  to  the  conventional  8-bit  full  search  method,  which  requires  more  than  2700  flops 
per  pixel. 


15 


Image 

Sequence 


Motion 

Vectors 


Figure  11:  Block  diagram  of  the  proposed  BMA 
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Figure  12:  Computation  of  the  band-pass  filtered  image 


Foreman 


Figure  13:  Threshold  selection 


4.4  Multi-Frame  Gray-Scale  Resolution  Enhancement 

When  images  are  digitized,  a  certain  number  of  bits  is  assigned  to  each  pixel  to  represent  its 
intensity.  The  number  of  bits,  the  bit  depth,  determines  the  number  of  gray  levels  between  the 
minimum  and  maximum  values  that  the  imaging  device  can  capture.  There  will  be  a  loss  of  gray¬ 
scale  resolution  if  the  bit  depth  is  not  sufficient.  When  a  set  of  low  bit-depth  images  is  available 


16 


Table  1:  Average  PSNR  and  Search  Time  Comparison. 


Video  Sequence  | 

Susie 

Foreman 

Coast  Guard 

Tennis 

Flower 

(240x352) 

(288x352) 

(240x352) 

(240x352) 

(240x352) 

PSNR 

Time 

PSNR 

Time 

PSNR 

Time 

PSNR 

Time 

PSNR 

Time 

Method 

(dB) 

(sec) 

(dB) 

(sec) 

(dB) 

(sec) 

(dB) 

(sec) 

(dB) 

(sec) 

2  Level  (8) 

36.0247 

0.469 

32.8750 

0.562 

27.5283 

0.469 

30.6896 

0.469 

25.1050 

0.469 

3  Level  (8) 

36.0165 

0.719 

32.9885 

0.875 

27.6164 

0.719 

31.0983 

0.719 

25.2322 

0.719 

2  Level  (16) 

35.8953 

0.250 

31.5188 

0.312 

27.4410 

0.250 

29.6036 

0.250 

24.3092 

0.250 

3  Level  (16) 

35.7577 

0.313 

31.7258 

0.375 

27.5311 

0.313 

30.1381 

0.313 

24.6414 

0.313 

8-bit  (16) 

36.1092 

47.1 

32.8216 

47.1 

27.9828 

47.1 

31.4615 

47.1 

25.9193 

47.1 

that  are  slightly  different  from  each  other  because  of  motion  or  illumination,  their  non-redundant 
information  can  be  combined  to  enhance  the  gray-scale  resolution.  We  refer  to  this  process  of 
multi-frame  gray-scale  resolution  enhancement  as  superprecision. 

Superprecision  can  be  applied  in  several  application  areas.  One  of  the  most  important  of  these  is  in 
medical  imaging.  With  these  images  low-contrast  details  are  often  extremely  critical  for  diagnosis, 
but  when  the  bit  depth  is  insufficient,  these  details  may  be  lost.  Superprecision  reconstruction  has 
the  potential  to  regain  these  details  by  combining  the  non-redundant  information  that  is  present  in 
a  set  of  images.  Military  automatic  target  detection,  aerial  and  satellite  remote  sensing,  and  high- 
quality  scanning  applications  can  also  use  make  effective  use  of  superprecision  reconstruction  to 
enhance  gray-scale  resolution.  The  conversion  of  images  from  a  low  bit-depth  format  to  a  higher  one 
(e.g.,  conversion  from  8-bit  GIF  to  24-bit  JPEG)  is  also  a  potential  application  of  this  technology. 

The  superprecision  problem  is  very  similar  to  the  superresolution  problem  in  which  higher  spatial 
resolution  is  sought  from  a  set  of  low  spatial  resolution  images  [63].  Although  the  superresolution 
problem  has  received  a  considerable  amount  of  attention,  the  superprecision  has  not  been  as  actively 
researched.  In  an  early  paper,  Cheeseman  et  al.  [16]  proposed  increasing  both  the  spatial  and  gray¬ 
scale  resolution  at  the  same  time  by  using  a  maximum  a  posteriori  probability  estimator.  They 
assumed  a  Gaussian  model  for  all  of  the  probability  distributions  and  used  Jacobi’s  method  to  solve 
the  problem  iteratively.  In  this  paper,  we  propose  a  deterministic  method  based  on  a  projections 
onto  convex  sets  (POCS)  technique  that  does  not  make  any  assumptions  about  the  underlying 
probability  densities.  Using  this  method,  an  initial  high-resolution  image  estimate  is  projected 
onto  constraint  sets  that  are  derived  from  the  low  resolution  gray-scale  image  observations.  The 
method  can  work  either  in  the  spatial  domain  or  in  the  transform  domain,  where  it  is  possible  to 
include  details  of  the  compression  process.  In  some  cases  superprecision  and  superresolution  can 
be  achieved  at  the  same  time. 

Section  2  presents  the  imaging  model  to  be  used  in  the  reconstruction.  The  POCS-based  super¬ 
precision  approach  is  presented  in  Section  3,  and  Section  4  gives  the  final  algorithm.  Section  5 
presents  some  experimental  results,  and  Section  6  concludes  the  paper. 
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Figure  14:  A  model  for  the  production  of  a  low-resolution  image  from  a  high-precision,  high- 
resolution  image  by  the  image  recording  process. 

4.4.1  The  Imaging  Model 

This  section  presents  a  model  that  establishes  the  connection  between  a  high-resolution  source 
image  and  multiple  low  bit-depth  image  observations.  This  connection  will  be  used  in  the  next 
section  to  enhance  the  gray-scale  resolution  of  the  images. 

As  seen  in  Figure  39,  the  model  has  two  components.  The  first  models  the  image  capture  process.  A 
high-resolution  image,  f(Nl^  (m,ri2),  is  captured  by  an  imaging  device  to  produce  the  low  resolution 
images,  g(-Nl\mi,m2,  k).  Here  the  superscript  N\  represents  the  number  of  bits  used  to  represent 
each  pixel  and  (ni,  712)  and  (mi,  m2,  k )  are  the  spatial  pixel  coordinates  of  the  high-resolution  image 
and  the  kth  low-resolution  image,  respectively.  The  image  capture  process  is  a  linear,  shift-varying 
(LSV)  operation  that  includes  motion  (of  the  camera  or  the  objects  in  the  scene),  blur  (because  of 
the  nonzero  sensor  aperture  time,  the  nonzero  physical  dimensions  of  the  individual  sensor  elements, 
the  degree  of  focus,  etc.),  and  sampling  with  a  low-resolution  grid  [42].  In  this  paper,  we  model  all 
of  these  effects  except  for  the  sensor  aperture  time,  which  is  taken  to  be  zero.  According  to  this 
model,  the  mapping  from  a  high-resolution  image  to  a  low  spatial-resolution  image  is  expressed  as 
a  weighted  sum  of  the  high-resolution  image  pixels,  where  the  weights  are  the  values  of  a  space- 
invariant  point-spread  function  (PSF)  at  the  corresponding  pixel  locations.  The  center  of  the  PSF 
depends  upon  the  motion  between  the  high-resolution  image  and  the  low-resolution  images.  This  is 
depicted  in  Figure  15.  Motion  vectors  from  each  low  spatial-resolution  image  to  the  high-resolution 
image  determine  how  each  pixel  is  mapped.  The  normalized  PSF  that  characterizes  the  camera 
is  centered  at  that  location,  and  from  it  the  weights  on  the  high-resolution  image  grid  are  found. 
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(8) 


Defining  h(mi,m2,fc;rai,n2)  as  this  mapping,  we  can  model  the  image  capture  process  by: 

k)  =  E  h{mi,  m2,  &;  m,  n2)/(Ari)(m,  n2). 

ni,n2 

Equation  8  provides  the  relation  between  a  single  high-resolution  image  and  the  set  of  low  spatial- 
resolution  images.  Further  details  concerning  this  modeling  can  be  found  in  [46]. 


g(N‘\mi,m2,k+ 1) 

Figure  15:  A  mapping  of  the  high  resolution  point-spread  function  to  the  lower  resolution  sampling 
lattices. 

The  second  sub-block  models  the  gray-scale  resolution  reduction,  which  is  nothing  but  a  reduction 
in  the  number  of  bits  used  to  represent  each  pixel.  Bit  depth  reduction  from  Ni  bits  to  AT2  bits  is 
equivalent  to  the  operation: 

g('N2\mi,m2,k)  =  round  (mi  *)}.  0) 

where  (mi,  m2,  k )  corresponds  to  the  low-resolution  (both  in  gray-scale  and  in  the  spatial  vari¬ 
ables)  image  observations.  The  factor  2N2~Nl  scales  the  image  to  the  new  range,  and  the  rounding 
operation  round{-},  which  rounds  the  argument  to  the  nearest  integer,  effects  the  quantization. 

Letting  6 (mi, m2,  k )  denote  the  error  introduced  by  rounding,  we  can  rewrite  Equation  9  as 

g<'N2\mi,m2,k)  =  2N2~Nlg(~Nl'>(mi,m2,  k)  +  6(mi,m2,  k),  (10) 

where 

|<S(mi,m2,  k)\  <  0.5.  (11) 


If  the  images  are  transform  coded  (using  JPEG  or  MPEG  for  instance),  a  similar  relation  can  be 
formulated  in  the  transform  domain.  The  most  common  transform  is  the  discrete  cosine  transform 
(DOT)  applied  on  8  x  8  blocks.  Taking  an  8  x  8  block-DCT  of  Equation  10,  we  get: 

G(N2Xh,l2,k)  =  2N2-N'G(N'\li,l2,k)  +  A(li,l2,k),  (12) 
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where  G^2)  (fa,  fa ,k),  G^Nl\fa,  fa ,k),  and  A(Zi ,  fa,  k)  are  the  DCTs  of  g^N^ (mi,  m2,  k),  g^Nl^ (mi,  m2,  k), 
and  S(fa,fa,k),  respectively.  A(fa,fa,k)  is  also  bounded  in  magnitude  since  6(fa,fa,k)  is  bounded, 
but  its  bounds  depend  on  the  position  (fa,  fa)-  They  are  derived  in  the  Appendix.  (It  should 
be  noted  that  the  bounds  of  A(fa,fa,k)  are  not  equal  to  the  DCT  of  the  bounds  of  S(fa,fa,k).) 
Applying  the  block-DCT  to  Equation  8,  we  can  write  G^Nl'(fa,fa,k)  as: 

G(-Nl\fa,fa,k)  =  y]  hpcT(h,fa,k;ni,n2)f(Nl\ni,n2),  (13) 

m,n2 

where  hpcTih,  fa- k\ n\,  7/2)  is  the  block-DCT  of  h(m\,m2,  k'.  rq  ,  nfa-  Mathematically,  the  relation 
can  be  written  as: 


L(h)+7  L(h)+ 7 

hDCT(h,fa,k-,ni,n2)  =  ^  ^  K  (fa,fa]  (mi)8 ,  (m2)8)  h(mi,m2,k\n\,n2),  (14) 

mi=L(li)  rri2=L(l2) 


where  (-)8  is  the  modulo  8  operator,  L(x)  =  / 8 J ,  and  K(fa,fa;mi,m2)  is  the  DCT  kernel  given 

by: 

isn  1  \  1  1  { (2mi  +  l)fair\  ( (2m2  +  l)fan\ 

K(l1,fa-,m1,m2)  =  khkh  cos  I  - — -  1  cos  I - - J  •  (15) 

kit  is  the  normalization  factor 


1 

4V2 


1 

4 


0 


h  =  0 


z  =  1,2. 


(16) 


To  summarize  we  have  two  equations  that  relate  the  HR  image  to  LR  images  and  DCT  coefficients: 

g<'Ni)(mi,m2,k)  =  2N2~Nl  ^  h(mi,m2,k-,ni,n2)f^Nl)(ni,n2)  +  8(mi,m2,k),  (17) 

Til, 712 

and 

G{N2\h,fa,k)  -  2N2~Nl  y  hDCT(fa,fa,k\ni,n2)f^Nl\ni,n2)  +  A(fa,fa,k).  (18) 

n  l>«2 

These  two  equations  will  be  used  in  the  POCS-based  superprecision  estimation  algorithms.  Equa¬ 
tion  18  will  also  be  extended  to  include  the  compression  process  where  the  DCT  coefficients  are 
quantized  according  to  quantization  tables. 


4.4.2  Superprecision  Methods 

This  section  formulates  POCS-based  superprecision  in  the  spatial  and  transform  domains.  The 
transform-domain  formulation  also  includes  the  quantization  process  that  is  common  in  image  and 
video  compression  standards  such  as  JPEG,  MPEG,  and  H.263. 

Spatial-Domain  POCS  Solution 

From  Equation  17,  it  is  seen  that  the  value  of  2N'2  ~ Nl  gNl  \mi,  m2 ,  k )  falls  within  the  0.5  proximity 
of  the  observed  pixel  intensity  g(N2\mi,m2,k).  This  information  can  be  used  to  define  a  convex 
constraint  set  for  each  observed  pixel.  The  method  follows. 
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The  image  capture  process  shown  in  Figure  39  is  applied  on  an  initial  high-resolution  image  estimate 
x^Nl^ (711,77-2),  and  then  scaled  by  the  factor  2N2~Nl.  The  result  is  compared  to  the  observations 
g(N2^ (mi, m2,  k).  It  is  known  that  the  residual,  the  difference  between  the  computed  and  the 
observed  images,  must  be  less  than  0.5  in  magnitude  if  the  estimate  is  correct.  If  the  residual  is 
greater  than  0.5,  then  the  error  is  back-projected  onto  the  initial  estimate  so  that  the  next  time  the 
image  capture  model  and  scaling  are  applied,  it  will  fall  within  the  0.5  proximity  of  the  observations. 

Defining  the  residual  as: 

rx(mi,m2,k )  =  g(N2\mi,m2,k)- 

2N2-N1  ^  h(mi,m2,k-,ni,n2)x^Nl\ni,n2),  (19) 

m,n2 

we  can  write  the  convex  constraint  set  for  an  arbitrary  image  x^Nl\ni,n2)  as  follows: 

c(mi,m2,k)  =  {*(iVl)(«i>«2)  :  \rx(mi,m2,k)\  <  0.5}  .  (20) 


The  projection  operation  onto  these  convex  sets  is  given  by: 

-^(mi,ra2,fc)  1  ^  (^1 5  ^2 )] 

f  x<N'\nun2)  +  rx(mi,m2,k)  >  0.5 


E  |Mmi>m2,fc;raij«2)i 

nl  ,n2 


x^Nl^(ni,n2),  —  0.5  <  rx(mi,m2,k)  <  0.5 

X(N^(nUn2)  +  2^-Jy»(r«gn.m2,fc)+0.5)/>(m1  (  fc)  <  -0.5 

v  7  E  \h(mi  ,m2 ,  A;;ni  ,n2 )  |  v  77 


(21) 


n  1,712 


Compressed-Domain  POCS  Solution 

Equation  18  provides  the  relation  between  the  DCT  coefficients  G(N2\h,l2,k)  and  the  HR  image 
f(Nl\ni9ri2).  The  DCT  coefficients  G^N2\l\^k)  are  quantized  in  the  process  of  compression. 
Defining  &N2\l\,l 2,  k )  as  the  quantized  DCT  coefficients,  and  Q(l\:  h ,  fc)  as  the  quantization  error, 
we  can  write: 

&N*\h,l2,k)  =  G^(h,l2,k)  +  Q(h,l2,k),  (22) 

where  Q(l\,  l2,  k )  is  bounded  by  half  of  the  quantization  step  size  at  location  (h,l2).  Using  Equation 
18,  we  get 

G^\h,l2,k)  =  2^-^  hDCT(hj2,k\ni,n2)f(Nl\ni,n2)  +  A(h,l2,k)  +  Q(/i,/2,fe).  (23) 

ni,ri2 


The  sum  of  of  A(li,l2,k)  and  Q(h,l2,  k)  is  bounded  by  B(l\,l2,k),  which  is  equal  to  the  sum  of 
the  bounds  of  A(li,l2,k)  and  Q(h,l2,k).  That  is, 

\A(h,l2,k)  +  Q(h,l2,k)\<B{h,l2,k),  (24) 
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where 


L(l  i)+7  L(l2)+ 7 

B(lul2,k)  =0.5  X]  |^(h,/2;(mi)8,(m2)8)|  + 

mi=L(ii)  7712=1/(^2) 


Quantization  step  size  at  location  (Zi,Z2) 
2 


(25) 


The  first  term  on  the  right-hand  side  of  Equation  25  is  derived  in  the  Appendix.  Equation  23  along 
with  the  bound  B(l\,l2,k)  allow  us  to  write  a  POCS  reconstruction  algorithm  in  the  compressed 
domain  that  is  analogous  to  the  spatial-domain  algorithm  derived  in  the  previous  section. 


Now  we  can  follow  the  same  procedure  as  in  the  previous  section  to  create  the  projection  operator. 
Defining  the  compressed-domain  residual  as 


Rx(h,h,k)  =  &N*Xh,l2,k)- 

2N2-Ni  hDCT{hM,k\ni,n2)x^Nl\n1,n2),  (26) 

ni,n2 

we  can  write  the  convex  constraint  set  for  an  arbitrary  image  (711,712)  as  follows: 

C(h,h,k)  =  {z(JVl)(m,n2)  :  \Rx(h,h,k)\  <  B(h,l2,k)}  .  (27) 


The  projection  operation  onto  these  convex  sets  is  given  by: 
p(h,h,k)  [z(JVl)(m,n2)]  = 


(  x^Nl^(nx  no)  +  2iVl  (Ac (h>b, *:)-£(/ i,h,k))hpcT(h,h,k-,ni,n2) 

I  ’  '  £  I  hDCT  (i  1  ,<2  ,fe;ni  ,«2 )  1 3  ’ 

I  nl  »n2 


Rxihth,  k)  >  B(l\,  I2, k ) 


i  X^\nun2),  -B(h,l2,k)<Rx(h,l2,k)<B(l1,l2,k) 
x(NlHni  no)  +  2Nl~jV2(-Rx(h,b,fc)+^(b,b,fc))fcgc r(h.b,fc;ni,n2) 

’  '  Y,  \hDCT(ll,h,k-,ni,n2)\2  ’ 

V  nx,n2 


Rx{h,h,k )  <  -B(h,l2,k) 

(28) 


4.4.3  Implementation 

Implementations  of  both  the  spatial-  and  transform-domain  algorithms  are  very  similar.  In  the 
spatial-domain  implementation  the  error  back-projected  is  the  error  in  the  pixel  intensities.  In  the 
transform-domain  implementation  it  is  the  error  in  the  DCT  coefficients  that  is  back-projected.  An 
illustration  of  using  multiple  constraint  sets  to  restrict  the  solution  set  is  illustrated  in  Figure  16. 
The  final  algorithm  for  spatial-domain  (and  compressed-domain)  reconstruction  is  the  follows. 

1.  Choose  a  reference  frame  and  bilinearly  interpolate  it  to  get  an  initial  fine-grid  image. 

2.  Extend  the  bit  depth  of  this  image  to  the  required  bit  depth  by  multiplying  each  pixel  intensity 
by  2Nl~N2  and  filling  in  the  last  N\  —  N2  bits  randomly.  The  resulting  image,  ^(Wl)(m, n2), 
is  an  initial  estimate  for  the  high  resolution  image  /(Wl)(m,n2). 

3.  Compute  the  motion  between  the  reference  frame  and  one  of  the  low-bit-depth  images, 
g('N2'>{mi,m2,k). 
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4.  Using  the  motion  estimates,  compute  the  mapping  h(mi,ni2, k;ni,n2)  for  each  pixel  in  the 
current  image  g(-N2\mi,m2,k).  (Compute  hpcTihih,  k]  ni,  n2)  for  each  DCT  coefficient  in 
the  case  of  transform-domain  implementation) 

5.  For  each  pixel  (DCT  coefficient)  in  the  current  image, 

(a)  Compute  the  pixel  intensity  (DCT  coefficient  value)  from  the  estimate  x^Nl\n\,ri2)  by 
applying  the  image  capture  processing  model  and  scaling, 

(b)  Compute  the  residual  rx(mi,m2,k)  (Rx(h,h,  k))  and  back-project  the  error  to  the  es¬ 
timate  x(Nl\ni,ri2)  using  Equation  21  (Equation  26). 

6.  Stop,  if  a  stopping  criterion  is  reached;  otherwise,  choose  another  low-bit-depth  image,  and 
go  to  step  2. 


It  should  be  noted  that,  by  construction,  this  algorithm  has  the  potential  to  achieve  both  spatial 
and  gray-scale  resolution  enhancement  at  the  same  time.  If  the  high-resolution  image  estimate  has 
a  finer  grid  than  the  observations,  both  spatial  and  gray-scale  resolution  enhancement  are  achieved. 
If  they  have  the  same  sampling  grid,  only  gray-scale  resolution  enhancement  is  achieved. 


4.4.4  Experimental  Results 


The  high-resolution  image  Money,  shown  in  Figure  17  is  jittered,  blurred  by  a  Gaussian  kernel 
(with  a  support  of  5  x  5  and  variance  of  2.5),  and  downsampled  to  produce  eight  spatially  low- 
resolution  images.  Their  gray-scale  resolution  is  then  reduced  from  eight  bits  to  a  lower  number  of 
bits.  The  lower-precision  images  are  then  processed  by  the  proposed  algorithm  to  produce  high- 
quality  images  with  twice  the  spatial  resolution  and  eight  bit  gray-scale  resolution.  Figures  18 
and  20  show  four-  and  three-bit  Money  images.  The  reconstructed  images  are  shown  in  Figures 
19  and  21.  The  Money  images  are  also  MPEG-1  encoded  (using  eight  frames)  to  illustrate  the 
transform-domain  implementation.  Figures  22  and  23  depict  the  reconstructed  images. 

The  reconstructions  algorithm  is  also  applied  to  images  that  have  a  bit  depth  of  only  one  and  two 
bits.  Figures  24  through  27  show  these  cases. 

The  reconstructions  made  from  four-  and  three-bit  images  show  significant  improvement  in  both 
gray-scale  and  spatial  resolution.  Although  the  image  quality  for  the  reconstructed  images  from  a 
bit-depth  of  two  and  one  bits  are  not  satisfactory,  characters  that  are  not  readable  on  the  originals 
become  readable  after  the  reconstruction. 

In  the  experiments,  motion  is  computed  using  the  Hierarchical  Block  Matching  (HBM)  method 
of  Bierling  [141].  Three  hierarchical  levels  are  used  with  the  Mean  Absolute  Difference  (MAD)  as 
the  matching  criterion.  In  the  final  level,  motion  estimates  are  obtained  with  one-quarter  pixel 
accuracy. 
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4.4.5  Conclusion 


In  this  paper,  we  presented  a  superprecision  method  that  increases  the  gray-scale  resolution  by 
combining  the  non-redundant  information  from  a  set  of  low  bit-depth,  low  resolution  images.  It  is 
based  on  the  projections  onto  convex  sets  (POCS)  technique  where  the  convex  sets  are  defined  using 
the  bit-depth  reduction  information.  With  the  transform-domain  implementation,  the  quantization 
error  that  results  from  compression  can  also  be  included  in  the  reconstruction.  The  method  can  also 
increase  the  spatial  resolution  if  a  finer  sampling  grid  is  used  for  the  initial  high-resolution  image 
estimate.  Therefore,  the  proposed  method  can  be  considered  as  a  generalization  to  superresolution 
reconstruction. 


Appendix: 


The  error  5(rai,ra2,/c)  is  bounded  by  0.5.  When  the  image  is  transformed  to  the  DCT  domain, 
there  will  be  a  new  error  A(Zi,/2?&).  This  appendix  determines  the  bounds  that  are  appropriate 
for  A(Zi,Z2,  k). 


Following  from  Equations  10  and  12,  the  bound  max  |  A(Zi?  Z2,  fc)|  can  be  found  by: 


max|A(Zi,Z2>fc)| 


_  DCT  {2n*  Nlg(Nl'>(mi,m2,k)  +5(m1,m2,k)} 

-  DCT  {2JV2-JvigW)(mi,m2,  k)} 

—  max  \DCT  {<5(mi,  m2,  k)}\ 


(29) 


Using  the  definition  of  DCT{ •},  we  get: 

I  L(l i)+7 


max  |A(Zi,Z2,fc)|  =  max 


E 


L(h)+7 

E  K  (Zi,  Z2;  (mi)8  ,  (m2)8)  <5(mi,  m2,  k) 


=  0.5 


\mi=L(li)  m.2=L(l2) 
L(l  i)+7  L(l2)+ 7 


E  l^(/i^2;(m1)8,(m2)8)|. 


mi=L(h)  rri2=L(l2) 


(30) 
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Observed  pixel 
intensity 


Figure  16:  Illustration  of  how  the  POCS  algorithm  uses  multiple  coarse  quantization  values  to 
produce  a  finer  estimate  of  a  sample  value. 
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Figure  17:  Original  Money  image. 


4.5  Face  Recognition  from  Video 

The  performance  of  existing  face  recognition  systems  decreases  significantly  if  the  resolution  of  the 
face  image  falls  below  a  certain  level.  This  is  especially  critical  in  surveillance  imagery  where  often 
only  a  low-resolution  video  sequence  of  the  face  is  available.  If  these  low-resolution  images  are  passed 
to  a  face  recognition  system,  the  performance  is  usually  unacceptable.  Therefore,  super-resolution 
techniques  have  been  proposed  for  face  recognition  that  attempt  to  obtain  a  high-resolution  face 
image  by  combining  the  information  from  multiple  low-resolution  images  [15,  6,  39,  14].  In  general, 
super-resolution  algorithms  try  to  regularize  the  ill-posedness  of  the  problem  using  prior  knowledge 
about  the  solution,  such  as  smoothness  or  positivity  [21].  Recently,  researchers  have  proposed 
algorithms  that  attempt  to  use  model-based  constraints  in  regularization.  While  [15]  demonstrates 
how  super-resolution  (without  model-based  priors)  can  improve  the  face  recognition  rate,  [6],  [39], 
and  [14]  provide  super-resolution  algorithms  that  use  face-specific  constraints  for  regularization. 

All  these  systems  propose  super-resolution  as  a  separate  preprocessing  block  in  front  of  a  face  recog¬ 
nition  system.  In  other  words,  their  main  goal  is  to  construct  a  high-resolution,  visually  improved 
face  image  that  can  later  be  passed  to  a  face  recognition  system  for  improved  performance.  This 
is  perfectly  valid  as  long  as  computational  complexity  is  not  an  issue.  However,  in  a  real-time 
surveillance  scenario  where  the  super-resolution  algorithm  is  expected  to  work  on  continuous  video 
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Figure  18:  Low  resolution  four- bit  Money  image  (one  of  eight). 


streams,  computational  complexity  is  usually  a  very  critical  issue.  In  this  paper,  we  propose  an  effi¬ 
cient  super-resolution  method  for  face  recognition  that  embeds  the  super-resolution  reconstruction 
into  the  face  recognition  system.  This  is  based  on  the  observation  that  nearly  all  state-of-the-art 
face  recognition  systems  use  some  kind  of  front  end  dimensionality  reduction,  and  that  a  lot  of 
detailed  information  generated  by  a  preprocessing  type  super-resolution  algorithm  is  not  used  by 
the  face  recognition  block.  Hence,  we  propose  to  embed  the  super-resolution  reconstruction  into  the 
low-dimensional  framework  of  the  face  recognition  system  so  that  only  the  necessary  information  is 
reconstructed  without  any  unnecessary  overhead.  In  addition  to  the  computational  complexity  re¬ 
duction,  we  also  derive  face-specific  constraints  for  the  low-dimensional  framework  and  demonstrate 
how  they  improve  the  performance. 

Currently,  by  far  the  most  popular  dimensionality  reduction  technique  in  face  recognition  is  to  use 
subspace  projections  based  on  the  Karhunen-Loeve  Transform  (KLT).  This  type  of  dimensionality 
reduction  has  been  central  to  the  development  of  face  recognition  algorithms  for  the  last  ten  years. 
We  propose  to  use  a  similar  KLT-based  dimensionality  reduction  technique  to  decrease  the  com¬ 
putational  cost  of  the  super-resolution  algorithm  by  transforming  it  from  a  problem  in  the  pixel 
domain  to  a  problem  in  the  lower-dimensional  face  subspace. 

There  are  two  important  sources  of  noise  in  this  problem.  One  is  the  observation  noise  that  results 
from  the  imaging  system.  The  other  is  the  subspace  representation  error,  which  is  a  result  of  the 
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Figure  19:  Reconstruction  from  the  four-bit  images  shown  in  Figure  18. 


dimensionality  reduction.  We  derive  the  statistics  of  these  noise  processes  for  the  low- dimensional 
subspace  by  using  the  examples  from  the  human  face  image  class.  Substitution  of  this  model-based 
information  into  the  algorithm  provides  a  higher  robustness  to  noise.  We  test  our  system  on  both 
real  and  synthetic  video  sequences. 

In  Section  2,  we  briefly  review  the  KLT-based  dimensionality  reduction  method  for  face  recognition. 
Then,  in  Section  3,  we  formulate  the  super-resolution  problem  in  the  low-dimensional  framework. 
Section  4  details  the  reconstruction  algorithm,  and  Section  5  provides  experimental  results.  Con¬ 
clusions  are  given  in  Section  6. 


4.5.1  Dimensionality  reduction  for  face  recognition 


Mathematically,  the  eigenface  method  tries  to  represent  a  face  image  as  a  linear  combination  of 
orthonormal  vectors,  called  eigenfaces.  These  eigenfaces  are  obtained  by  finding  the  eigenvectors 
of  the  covariance  matrix  of  the  training  face  image  set  [68].  The  eigenvectors  corresponding  to  the 
largest  L  eigenvalues  span  a  linear  subspace  that  can  reconstruct  the  face  images  with  minimum 
reconstruction  error  in  the  least  squares  sense.  This  L-dimensional  subspace  is  called  the  face 
space.  Assuming  x  is  the  lexicographically  ordered  face  image  and  is  the  matrix  that  contains 


28 


Figure  20:  Low  resolution  three- bit  Money  image  (one  of  eight). 


the  eigenfaces  as  its  columns,  we  can  write: 

x  —  4>a  +  ex  (31) 

where  a  is  the  Lx  1  feature  vector  that  represents  the  face,  and  ex  is  the  subspace  representation 
error  for  the  face  image. 


4.5.2  Super-resolution  in  the  face  subspace 

In  this  section,  we  formulate  the  super-resolution  problem  in  the  low-dimensional  face  subspace. 
In  such  a  formulation,  the  observations  are  inaccurate  feature  vectors  of  a  subject,  and  the  re¬ 
construction  algorithm  estimates  the  true  feature  vector.  We  start  with  the  observation  model 
for  pixel-domain  super-resolution,  and  then  derive  the  observation  model  for  face-space  super¬ 
resolution  using  the  eigenface  representation.  In  pixel-domain  super-resolution,  the  observations 
are  low-resolution  images  that  are  related  to  a  high-resolution  image  with  a  linear  mapping.  By 
ordering  images  lexicographically,  such  a  relation  can  be  written  in  matrix-vector  notation  as  fol¬ 
lows: 

y(*)  =  H(*)X  +  n(*),  for  i  =  (32) 

where  x  is  the  unknown  high-resolution  image,  yW  is  the  ith  low-resolution  image  observation, 
is  a  linear  operator  that  incorporates  the  motion,  blurring,  and  downsampling  processes,  is  the 
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Figure  21:  Reconstruction  from  the  three-bit  images  shown  in  Figure  20. 


noise  vector,  and  M  is  the  number  of  observations.  Assuming  that  s  is  the  downsampling  factor 
(0  <  s  <  1),  and  that  the  high-resolution  image  is  of  dimension  N  x  N-,  yW,  H^,  x,  and  have 
dimensions  s2N 2  x  1,  s2N2  x  N2,  N2  x  1,  and  IV2  x  1,  respectively.  Details  of  such  modeling  can 
be  found  in  [21,  42],  and  we  will  not  elaborate  on  it  in  this  paper.  The  images  x  and  yW  have  two 
components  that  are  in  and  orthogonal  to  the  face  space.  Only  the  components  that  lie  in  the  face 
space  are  necessary  in  recognition.  We  will  now  derive  the  observation  model  for  the  reconstruction 
of  the  components  that  lie  in  the  face  space.  The  formulation  and  reconstruction  algorithm  will 
not  neglect  the  spatial-domain  observation  noise  and  the  subspace  representation  error,  which  is 
initially  orthogonal  to  the  face  space  but  then  becomes  effective  during  the  imaging  process.  We 
start  with  writing  the  face  space  representation: 

x  =  $a  +  ex,  (33) 

y W  =  +  ey\  for  i  =  1, . . . ,  M  (34) 

where  #  and  are  N 2  x  L  and  s2N2  x  L  matrices  that  contain  eigenfaces  in  their  columns,  aW  is 
the  Lx  1  feature  vector  that  is  associated  with  the  ith  observation,  and  ex  and  are  the  N2  x  1 
and  s2N2  x  1  representation  error  vectors.  Note  that  we  have  two  different  eigenvector  bases,  $ 
and  \F,  corresponding  to  high  and  low  resolution  face  images,  respectively.  (If  we  had  included  an 
upsampling  matrix  in  H^,  then  we  could  use  the  same  basis  matrix.) 
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Figure  22:  Reconstruction  from  eight  compressed  four-bit  images. 


We  substitute  Equations  33  and  34  into  Equation  32  to  obtain 

*a(i>  +  =  Hw  $a  +  H(i)ex  +  n(i) .  (35) 

Now,  we  will  project  Equation  35  into  the  lower-dimensional  face  space  using  the  fact  that  the 
representations  errors  are  orthogonal  to  the  face  space  'F.  Since 

=  0,  for  i  —  (36) 

and 

^fTiSf  =  I,  (37) 

by  multiplying  both  sides  of  Equation  35  by  on  the  left,  we  obtain: 

a(i)  =  ^TH^$a  +  ^TH^ex  +  9Tn®.  (38) 

This  is  the  observation  equation  that  is  analogous  to  Equation  32.  It  gives  the  relation  between 
the  unknown  “true”  feature  vector  a  and  the  observed  “inaccurate”  feature  vectors  a^.  In  the 
traditional  way  of  applying  super-resolution,  the  unknown  high-resolution  image  x  in  Equation  32 
is  reconstructed  from  the  low-resolution  observations  y 1 W .  Then,  the  reconstructed  x  is  fed  into  a 
face  recognition  system.  (See  Figure  28.)  For  eigenface-based  face  recognition  systems,  a  better 
way  is  to  directly  reconstruct  the  low-dimensional  feature  vector.  Using  the  relation  provided  in 
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Figure  23:  Reconstruction  from  eight  compressed  three-bit  images. 


Equation  38,  accurate  feature  vectors  of  a  face  image  can  be  obtained  from  its  inaccurate  feature 
vector  observations.  This  is  illustrated  in  Figure  29.  The  face  observations  yW  are  first  projected 
to  the  face  space,  and  the  computationally  intensive  super-resolution  reconstruction  is  performed 
in  the  low-dimensional  face  subspace  instead  of  the  spatial  domain.  While  we  are  reconstructing 
the  feature  vectors  in  the  low-dimensional  subspace,  we  will  substitute  face  specific  information 
in  the  form  of  statistics  of  the  prior  distributions  of  the  feature  vectors  and  distributions  of  the 
noise  processes.  Since  all  of  these  information  is  transformed  to  the  low-dimensional  face  space, 
the  computational  complexity  is  kept  low  with  little  or  no  sacrifice  of  the  performance.  Also,  using 
model-based  information  in  regularization  helps  to  obtain  more  robust  results  when  compared  to 
the  traditional  super-resolution  algorithms. 


4.5.3  Reconstruction  algorithm 

In  this  section  we  present  a  reconstruction  algorithm  to  solve  Equation  38  based  on  Bayesian 
estimation.  The  algorithm  handles  the  observation  noise  and  subspace  representation  error  in  the 
low-dimensional  face  subspace.  The  maximum  a  posteriori  probability  (MAP)  estimator  a  is  the 
argument  that  maximizes  the  product  of  the  conditional  probability  ,a(M)|a)  and  the 
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Figure  24:  Low  resolution  two-bit  Money  image  (one  of  eight), 
prior  probability  p{ a): 


.  =  arg  max  |p(a^,  •  •  •  ,  sSM )  |a)p(a 


We  now  need  to  model  the  statistics  ^(a^i  •  •  •  ,  a(M)|a)  and  p(a).  The  prior  probability  p(a)  can 
simply  be  assumed  as  jointly  Gaussian: 


P(a)  =  \  exp  (-  (a  -  p a)T  A  1  (a  -  ^ua)) 


=  A/"(/Xa,A), 

where  A  is  the  L  x  L  covariance  matrix,  pa  is  the  Lx  1  mean  of  a,  and  Z  is  a  normalization 
constant.  We  also  introduced  the  notation  Af  (■,  •)  for  normal  distribution  to  simplify  the  notation 
for  the  rest  of  the  paper. 

In  order  to  find  pieS1*,-  •  •  ,  aSM^  |a),  we  first  model  the  noise  process  in  the  spatial  domain,  and 
then  derive  the  statistics  in  face  space.  We  define  a  total  noise  term  vW  that  consists  of  the  noises 
resulting  from  the  subspace  representation  error  ex  and  the  observation  noise  nW  in  the  spatial 
domain: 

VC0  A  Hwex  +  n«.  (41) 
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Figure  25:  Reconstruction  from  the  two-bit  images  shown  in  Figure  24. 


Using  this  definition,  we  rewrite  Equation  38  for  convenience: 

a w  =  *THw$a  +  «rvw. 


(42) 


The  reason  we  defined  H^ex  +  as  the  total  noise  term  instead  of  its  projection  onto  the  face 
subspace  is  because  of  the  modeling  convenience  in  the  spatial  domain.  It  has  been  demonstrated 
that  modeling  the  noise  (resulting  from  the  imaging  system  and  the  estimation  of  H^)  in  the 
spatial  domain  as  independent  identically  distributed  (IID)  Gaussian  random  vectors  is  a  good 
assumption  [21].  We  further  assume  that  the  covariance  matrix  of  these  random  vectors  is  diagonal 
so  that  the  statistical  parameters  can  be  estimated  easily  even  with  the  limited  training  data.  Using 
these  assumptions,  it  is  easy  to  find  the  distribution  of  \E,7V*)  in  the  face  space,  as  will  be  shown 
shortly. 

Defining  K  as  the  s2IV2  x  s2iV2  positive  definite  diagonal  covariance  matrix  and  //,[/  as  the  s2IV2  x  1 
mean  of  vW,  we  can  write  the  probability  distribution  of  as: 

p(v«)  =Af(74i),K)  .  (43) 


Now,  we  need  to  derive  the  distribution  of  the  projected  noise,  p(<i>TvW),  in  order  to  get  the  con¬ 
ditional  PDF  p( a^1),  •  •  •  ,  a^M)|a).  From  the  analysis  of  functions  of  multi- variate  random  variables, 
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Figure  26:  Low  resolution  one-bit  Money  image  (one  of  eight). 


it  follows  that  p(\PrvW)  is  also  jointly  Gaussian  since  is  nonsingular  (by  construction).  As 

a  result,  we  have: 

p(^Tvw )  =  M  ,  Q)  ,  (44) 

where  is  the  new  mean,  and  Q  is  the  new  covariance  matrix  computed  by  Q  =  4’rKvI'. 

The  covariance  matrix  Q  has  dimension  of  L  x  L  while  K  is  of  dimension  s2N2  x  s2N2.  Using 
Equations  42  and  44,  we  find  the  conditional  PDF  p(aW|a): 

p(&w  |  a)  =  J\f  Q)  •  (45) 


Since  we  assumed  that  vW’s  are  IID,  it  follows  that  the  probability  density  function  p(sSl\  •  •  •  ,  |a) 

is  the  product  of  p(aW  |a)  for  i  =  1,  •  •  •  ,  M.  Defining  =  a^  -  we  write: 

p(a(1) ,  •  •  ■  ,  a(M)  |a)  =  ^  exp  ^  £(i)TQ_1£(i)^  }  (46) 

where  Z  is  a  normalization  constant. 

Substituting  the  conditional  and  prior  PDFs  given  in  Equations  40  and  46  into  Equation  39  ,  we 


Figure  27:  Reconstruction  from  the  one-bit  images  shown  in  Figure  26. 
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Figure  28:  Super-resolution  applied  as  a  preprocessing  block  to  face  recognition 


(47) 


obtain  the  MAP  estimate  a  as  follows: 

{M 

J2  [^(i)TQ_1^(i)]  +  (a  -  ^a)T  A-1  (a  -  Ha) 
i=  1 

This  solution  can  be  obtained  easily  using  the  iterative  steepest  descent  algorithm. 

In  the  reconstruction,  everything  but  a,  aW,  and  HW  is  known  and  can  be  computed  in  advance. 
For  a  specific  observation  sequence  yW,  the  feature  vectors  aW  and  the  blur  mappings  are 
computed,  and  the  true  feature  vector  a  is  estimated  iteratively.  Each  iteration  in  such  algorithm 
requires  a  number  of  operations  that  is  directly  proportional  to  the  size  of  the  vector  to  be  recon¬ 
structed.  If  super-resolution  is  applied  as  a  preprocessing  block,  then  the  number  of  operations 
will  be  0(N2),  N2  being  the  number  of  pixels  in  the  face  image.  For  the  face-space  reconstruction 
algorithm,  the  number  of  operations  reduces  to  O(L).  Therefore,  face-space  super-resolution  pro¬ 
vides  an  efficiency  gain  proportional  to  N2 /L  in  computation  over  pixel-domain  super-resolution. 
(Typically  face  images  are  of  size  60  x  60  and  a  face  dimension  of  L  —  50  is  satisfactory,  in  which 
case  the  face-space  super-resolution  is  approximately  72  times  faster  than  the  pixel-domain  super¬ 
resolution.) 


4.5.4  Experimental  results 


We  performed  a  set  of  experiments  to  demonstrate  the  efficacy  of  the  proposed  method.  We 
investigated  the  effect  of  the  face  space  dimension,  and  sensitivity  to  noise  and  motion  estimation 
errors. 

In  these  experiments,  we  used  face  images  from  the  Yale  face  databases  A  and  B  [24],  Harvard 
Robotics  Laboratory  database  [31],  and  AR  database  [44].  The  images  are  downsampled  to  have  a 
size  of  40  x  40,  and  aligned  according  to  the  manually  located  eye  and  mouth  locations.  We  selected 
134  images  as  training  data  and  50  images  as  test  data.  We  applied  KLT  to  those  134  images  and 
chose  the  first  60  eigenvectors  having  the  largest  eigenvalues  to  form  the  face  subspace.  (These 
60  eigenvectors  form  the  columns  of  the  matrix  4>.)  We  also  downsampled  the  training  images  by 
four  to  obtain  10  x  10  images,  applied  the  KLT  to  those  images,  and  chose  the  first  60  of  them  to 
construct  the  eigenface  space  4>. 

The  test  images  were  jittered  by  a  random  amount  to  simulate  motion,  blurred,  and  downsampled 
by  a  factor  of  four  to  produce  multiple  low-resolution  images  for  each  subject.  The  motion  vectors 
were  saved  for  use  in  synthetic  video  experiments.  For  blurring,  the  images  were  convolved  with  a 
point  spread  function  (PSF),  which  was  set  to  a  5  x  5  normalized  Gaussian  kernel  with  zero  mean 
and  a  standard  deviation  of  one  pixel. 

From  the  training  image  set,  (K  =  134),  we  estimate  the  statistics  of  a  and  v^.  The  unbiased 
estimates  for  the  mean  and  covariance  matrix  of  a  are  simply  obtained  from  the  sample  mean  and 
variances. 

One  of  the  frames  for  each  video  sequence  is  chosen  as  the  reference  frame,  bilinearly  interpolated  by 
four,  and  projected  onto  the  face  space  $  to  obtain  the  initial  estimate  for  the  true  feature  vector. 
It  is  then  updated  using  the  steepest  descent  technique.  The  mapping  is  computed  from  the 
known  motion  vectors  and  PSF,  and  16  low-resolution  images  are  used  in  the  reconstruction. 


37 


We  also  wanted  to  compare  the  results  of  this  eigenface-domain  super-resolution  algorithm  with 
those  of  traditional  pixel-domain  super-resolution.  We  applied  the  pixel-domain  super-resolution  al¬ 
gorithm  given  in  [42]  to  the  low-resolution  video  sequences  again  using  the  same  16  low-resolution 
images  and  setting  the  number  iterations  to  seven.  After  the  high-resolution  images  are  recon¬ 
structed,  they  are  projected  onto  the  face  space  to  obtain  the  feature  vectors. 

The  feature  vectors  obtained  from  these  algorithms  are  compared  with  the  true  feature  vectors 
(which  are  computed  using  the  40  x  40  original  high-resolution  images).  Figure  30  shows  the 
results  for  the  normalized  distance  between  the  true  feature  vector  a  and  the  estimated  feature 
vector  a.  Figure  31  provides  an  example  from  the  face  database.  It  is  seen  that  eigenface-domain 
super-resolution  performs  as  well  as  the  pixel-domain  super-resolution  at  a  lower  computational 
complexity. 


4.5.5  Conclusions 

In  this  paper,  we  propose  to  apply  super-resolution  after  dimensionality  reduction  in  a  face  recogni¬ 
tion  system.  In  this  way,  only  the  necessary  information  for  recognition  is  reconstructed.  We  have 
also  shown  how  to  incorporate  the  model-based  information  into  the  face-space  reconstruction  algo¬ 
rithm.  This  helps  to  obtain  more  robust  results  when  compared  to  the  traditional  super-resolution 
algorithms.  We  investigate  the  effects  the  effects  of  the  feature  vector  length  (i.e.,  dimension  of  the 
face  space),  noise,  and  motion  estimation  error  on  the  performance.  The  detailed  results  will  be 
pn 


Figure  30:  Error  in  feature  vector  computation. 
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Figure  31:  (a)  Original  40  x  40  image,  (b)  10  x  10  low-resolution  observation  is  interpolated  using 
nearest  neighbor  interpolation,  (c)  10  x  10  low-resolution  observation  is  interpolated  using  bilinear 
interpolation,  (d)  Pixel-domain  super-resolution  applied,  (e)  The  result  of  pixel-domain  super¬ 
resolution  reconstruction  is  projected  into  the  face  subspace,  (f)  Representation  of  the  feature 
vector  reconstructed  using  the  eigenface-domain  super-resolution  in  the  face  subspace. 

4.6  Effects  of  Camera  Response  Function  and  Illumination  Changes  in  Multi- 
Frame  Image  Reconstruction 

An  image  is  a  two-dimensional  projection  of  a  real- valued  scene  that  is  a  function 

of  time  spectral  wavelength  A,  and  space  (x,y,z).  During  an  imaging  process,  the  quantity 
f(t\  A;  x,  y,  z)  is  degraded  in  several  ways.  The  degradation  may  be  on  the  domain  (£;  A;  x,  y,  z)  as 
well  as  on  the  range  /.  By  combining  multiple  measurements,  image  fusion  algorithms  produce  a 
composite  (or  a  set  of  composites)  that  has  more  information  than  any  of  the  individual  measure¬ 
ments  does.  (Image  fusion  is  sometimes  referred  to  as  displaying  images  on  top  of  each  other  to 
help  decision  making.  For  instance,  in  surveillance  systems,  visible  and  infra-red  (IR)  data  are  used 
to  create  a  composite  where  people  only  visible  in  the  IR  imagery  are  combined  with  the  context  of 
the  terrain  from  visible  image.  Here,  we  refer  to  image  fusion  as  reconstruction  of  original  scene.) 
With  image  fusion,  we  can  achieve  four  types  of  improvement:  temporal,  spectral,  space,  and  range. 

A.  Temporal  Improvement:  Frames  of  a  video  sequence  are  captured  at  a  certain  frame  rate,  and 
events  that  occur  faster  than  the  frame  rate  are  not  visible  or  else  observed  incorrectly.  This  is 
known  as  motion  aliasing.  Also,  due  to  the  nonzero  exposure,  fast  motion  causes  the  so-called 
motion  blur,  integration  of  light  on  the  sensor  array  over  time.  It  is  possible  to  catch  events  that 
are  faster  than  the  frame  rate  when  the  motion  can  be  modeled  [51]  and/or  when  there  are  multiple 
cameras  capturing  the  scene  with  a  time  offset  or  at  a  different  frame  rate  [56] . 

B.  Spectral  Improvement:  Spectral  filters  are  used  to  sample  a  certain  portion  of  the  spectrum. 
When  the  purpose  is  to  produce  a  color  picture,  three  different  spectral  filters  provide  sufficient 
visual  quality  for  most  of  the  time.  However,  in  multi-spectral  and  hyper-spectral  imagery,  high 
spatial  and  spectral  resolution  is  desirable,  and  various  algorithms  fusing  images  of  different  spectral 
and  spatial  resolutions  have  been  proposed  [25,  26]. 

C.  Space  Improvement:  The  result  of  an  imaging  process  is  a  two-dimensional  data  that  is  limited 
in  spatial  extent  and  resolution.  By  combining  multiple  observations,  it  is  possible  to  achieve 
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improvement  in  space.  The  improvement  can  be  three-dimensional  (3D)  reconstruction  when  the 
observations  and  the  scene  model  provide  sufficient  information  about  the  3D  structure.  In  some 
applications,  two-dimensional  improvement  is  sought.  For  instance,  super-resolution  reconstruction 
improves  spatial  resolution  by  sub-pixel  registration  of  images  [42,  34,  16,  53,  21].  In  mosaic 
construction,  an  image  of  larger  spatial  extent  is  constructed  [61].  Color  filter  array  interpolation 
is  another  example  image  fusion  where  spatial  resolution  of  the  color  channels  are  improved  using 
the  correlation  between  red,  green,  and  blue  spectral  data  [27]. 

D.  Range  Improvement:  Although  a  real  scene  has  in  general  a  wide  dynamic  range,  imaging  devices 
are  limited  in  dynamic  range.  During  an  imaging  process,  the  dynamic  range  of  a  scene  is  degraded 
by  noise,  dynamic  range  compression,  and  quantization.  Multi-frame  filtering  algorithms  have  been 
widely  used  to  get  rid  of  the  noise.  Recent  works  by  Mann,  Robertson  et  al.,  and  Candocia  improve 
the  dynamic  range  (range  extent)  of  the  images  [43,  48,  10,  11]. 

In  this  project,  we  are  proposing  an  image  fusion  algorithm  that  improves  resolution  and  extent  in 
both  range  and  spatial  domain.  Although  there  is  a  large  amount  of  work  done  aiming  to  improve 
spatial  resolution,  none  of  these  super-resolution  algorithms  consider  improving  range.  However, 
the  observations  may  have  information  diversity  in  range  due  to  changes  of  illumination  in  the 
scene  or  imaging  device  adjustments  such  as  exposure  time,  gain,  white  balance;  and  this  can  be 
used  to  produce  composite  images  of  higher  dynamic  range  (range  extent)  and  tonal  fidelity  (range 
resolution)  in  addition  to  the  higher  spatial  resolution  and  extent. 

Next  section  presents  an  observation  model  that  establishes  the  connection  between  a  scene  and 
multiple  observations  of  that  scene  that  are  limited  in  spatial  and  range  resolution/extent.  Based  on 
this  model,  an  image  fusion  algorithm  that  achieves  improvement  in  both  range  and  spatial  domain 
is  proposed  in  Section  3.  Section  4  addresses  some  of  the  implementation  issues,  and  provides 
preliminary  results. 


4.6.1  Imaging  model 

In  this  section,  we  will  extend  a  typical  imaging  model  used  in  super-resolution  algorithms  to  include 
dynamic  range  compression  and  quantization  operations.  Super-resolution  algorithms  model  the 
imaging  process  as  a  linear  mapping  between  a  high-resolution  input  signal  /(m ,  n2)  and  low- 
resolution  observations  Zi(mi,ni2).  This  mapping  includes  motion  (of  the  camera  or  the  objects  in 
the  scene),  blur  (because  of  the  nonzero  sensor  aperture  time,  the  nonzero  physical  dimensions  of 
the  individual  sensor  elements,  the  degree  of  focus,  etc.),  and  sampling  with  a  low-resolution  grid 
[42].  According  to  this  model,  the  mapping  from  a  high-resolution  image  to  a  low  spatial-resolution 
image  is  expressed  as  a  weighted  sum  of  the  high-resolution  image  pixels,  where  the  weights  are 
the  values  of  a  space-invariant  point-spread  function  (PSF)  at  the  corresponding  pixel  locations. 
The  center  of  the  PSF  depends  upon  the  motion  between  the  high-resolution  image  and  the  low- 
resolution  images.  Motion  vectors  from  each  low  spatial-resolution  image  to  the  high-resolution 
image  determine  how  each  pixel  is  mapped.  The  normalized  PSF  that  characterizes  the  camera 
is  centered  at  that  location,  and  from  it  the  weights  on  the  high-resolution  image  grid  are  found. 
Defining  hi  (mi,  m2;  ni,  n2)  as  this  mapping,  we  can  model  the  image  capture  process  by: 

Zi(mi,m2)  =  ^2  hi  (mi ,  m2 ;  ni ,  ra2)  /  (ni ,  n2 ) ,  (48) 

ni,n2 
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Figure  32:  Imaging  model. 


where  (ni,n2)  and  (mi,  m2)  are  discrete  coordinates  of  high-  and  low-  resolution  images,  respec¬ 
tively,  and  i  is  the  observation  number.  This  model  assumes  that  the  observations  are  captured 
under  the  same  illumination  condition  and  camera  settings.  However,  this  assumption  is  not  always 
valid.  It  is  likely  that  measurements  of  the  same  scene  are  obtained  with  different  camera  settings 
(such  as  exposure  time,  gamma  factor,  offset,  etc.)  in  addition  to  potential  illumination  changes  in 
the  scene  itself. 

We  now  extend  the  model  formulated  in  (58)  to  include  processes  affecting  the  range  of  the  observa¬ 
tion.  The  extended  model  is  given  in  Figure  39.  In  addition  to  the  previous  model,  this  model  also 
includes  the  effects  of  exposure  timer,  dynamic  range  compressor,  and  quantizer.  Exposure  timer 
determines  the  amount  of  light  falling  on  the  sensor  array.  In  order  to  see  dark  regions  in  a  scene, 
exposure  time  should  be  set  long.  With  a  long  exposure  time,  light  regions  in  a  scene  get  saturated. 
(When  we  have  multiple  images  of  the  same  scene  captured  with  different  exposure  times,  we  can 
register  those  images  and  obtain  a  composite  image  of  larger  dynamic  range  [43,  48,  10,  11].)  After 
dynamic  range  is  compressed,  pixel  intensities  are  digitized.  Usually,  eight  bits  are  used  for  each 
pixel. 


Including  the  effects  acting  on  range,  we  can  extend  the  equation  given  in  (58)  as  follows: 

Zi(mi,m2)  =  Q<F  r]il  ^  hi(mi,m2-,n1,n2)f(ni,n2)  \  +  m  >  ,  (49) 

l  L  \ni,n2  J  J 

where  F[-]  is  the  dynamic  range  compression  function,  Q{-}  is  the  quantizer,  rji  and  m  are  the  illu- 
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mination  gain  and  offset,  respectively.  In  this  model,  the  illumination  gain  and  offset  are  considered 
space-invariant.  This  is  a  valid  assumption  unless  there  is  non-uniform  illumination  change  within 
the  scene.  A  typical  dynamic  range  compression  function  is  depicted  in  Figure  40.  It  is  linear  (or 
log-linear)  at  midtones,  and  becomes  flat  at  low  and  high  intensities.  The  quantizer  is  assumed 
to  have  uniform  step  sizes;  therefore,  the  joint  effect  of  dynamic  range  compression  function  and 
quantizer  is  a  quantizer  of  non-uniform  step  sizes.  The  quantization  noise  is  larger  at  low  and  high 
pixel  intensities  compared  to  the  midtones. 


Figure  33:  Bounds  of  the  input  pixel  intensity  can  be  determined  from  the  observed  pixel  intensity. 
The  bounds  are  looser  towards  the  extremes  because  of  larger  quantization  step  sizes. 


4.6.2  Joint  Spatial  and  Range  Improvement 

In  this  section,  we  present  a  new  image  fusion  algorithm  for  joint  spatial  and  range  improvement. 
We  will  define  constraint  sets  using  the  quantization  bounds  of  the  pixel  intensities,  and  employ  a 
POCS  (projections  onto  convex  sets)  based  algorithm  to  produce  an  image  of  higher  spatial  and 
range  information. 

As  formulated  in  Equation  (59),  dynamic  range  compression  and  quantization  introduces  quanti- 
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zation  error  in  the  pixel  intensities  of  the  observations.  The  quantization  error  is  error  small  at  the 
midtones,  but  large  towards  the  ends  of  the  dynamic  range.  Let  (5j(mi,m2)  be  the  quantization 
error  at  pixel  (mi,  m2)  of  ith  observation,  then  we  can  rewrite  Equation  (59)  as: 


Zi(mi,m2)  =  Vi 


hi(mi,m2;ni,n2)/(ni, 


+  Hi+  6j(rni,m2). 


(50) 


Although  the  exact  value  of  the  quantization  error  cannot  be  determined  from  the  measurements, 
the  bounds  within  which  it  lies  can  be  determined  when  the  dynamic  range  compression  response 
of  the  camera  is  known.  This  is  illustrated  in  Figure  40.  The  lower  and  upper  bounds  of  the 
quantization  error  are  closer  to  each  other  for  midtone  pixel  intensities.  The  measurements  towards 
the  ends  of  the  dynamic  range  has  wider  bounds. 

We  now  define  constraint  sets  using  the  bounds  of  range  data,  and  employ  a  projections  onto  convex 
sets  (POCS)  algorithm.  Let  T\  (q)  and  Tu  ( q )  be  the  lower  and  upper  bounds  of  quantization  for  a 
measured  pixel  intensity  q,  then  we  can  write  the  constraint  set  on  any  measurement  Zj(mi,m2)  as 
follows: 


C'[^(mi,m2)]  =  {/(rn,n2)  :  Tt  (^(mi,m2))  <  |ri(mi,m2)|  <  Tu  (^(mi,m2))} ,  (51) 

where  a:(ni,n2)  is  the  initial  estimate  of  the  original  scene  /(ni,n2),  and  rj(mi,m2)  is  the  residual 
between  the  observed  pixel  intensity  and  the  intensity  computed  from  the  initial  estimate: 


Vi 


^2  ^(mi,m2;m,n2)x(ni,n2)  J  +  /i* 


\ni,n2 


(52) 


Set-theoretic  reconstruction  techniques  produce  solutions  that  are  consistent  with  the  information 
arising  from  observed  data  or  prior  knowledge  about  the  solution.  Each  piece  of  information  is 
associated  with  a  constraint  set  in  the  solution  space,  and  the  intersection  of  these  sets  represents 
the  space  of  acceptable  solutions  [19].  When  there  are  multiple  observations,  the  intersection  of 
the  constraint  sets  may  result  in  finer  range  resolution  and  dynamic  range.  This  is  illustrated  in 
Figure  41. 


For  each  pixel  in  the  observations,  we  can  define  a  constraint  set,  and  project  an  initial  estimate 
onto  these  constraint  set  to  improve  its  resolution  and  extent  in  range  and  spatial  domain.  After 
some  simple  algebra,  the  projection  operator  onto  a  constraint  set  C  [^(rai,ra2)]  can  be  found  as: 


■^C[zi(m  1,1712)]  ^2)]  xijl\ ,  7l2)  + 


m 


,n(mi,m2)  >  Tu  (Zi(mi,m2)) 

L  \hi{rni,m2]niyn2)\  I 

nl  >n2  / 

0,T/ (zi(mi,m2))  <  |ri(mi,m2)|  <  Tu  (zj(mi,m2)) 

1  I  (n^l.^)-rt(^(rn1,rn2)))hi(m1,m2;n1,n2)  .  ,  ri(mij  m2)  <  TX  m2)) 


Vi 


E  \hi{mi,m2;ni,n2)\ 

nl  >n2 


>  . 


(53) 


In  [42],  a  projections  onto  convex  sets  (POCS)  algorithm  is  presented.  In  that  algorithm,  bounds 
of  the  constraint  sets  are  set  to  a  fixed  number,  which  is  chosen  heuristically.  In  this  paper,  we 
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show  that  those  bounds  should  actually  be  a  function  of  observed  pixel  intensity,  and  should  be 
chosen  using  the  camera  response  function. 

In  the  next  section,  we  give  the  details  of  estimating  registration  parameters,  obtaining  initial 
estimate,  and  the  complete  algorithm. 


resolution  from  POCS 


Extended  dynamic 
range 

Figure  34:  Multiple  observations  lead  to  higher  resolution  and  wider  dynamic  range. 


4.6.3  Estimating  camera  response  function,  motion  and  illumination  parameters 


There  are  various  ways  of  estimating  camera  response  function,  and  motion  and  illumination  pa¬ 
rameters  [43,  48,  10,  11].  Although  all  parameters  can  be  estimated  jointly,  it  is  more  reliable 
to  estimate  camera  response  function  as  a  preprocessing  step,  and  then  use  it  in  determination 
of  motion  and  illumination  parameters.  Our  initial  approach  is  to  linearize  the  camera  response 
function  and  determine  its  parameters  [10].  For  the  motion  and  illumination  estimation,  we  use  a 
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parametric  motion  model,  and  determine  its  parameters  jointly  with  the  illumination  parameters. 
We  now  outline  this  approach  for  affine  motion  model;  it  can  be  extended  to  other  models  easily. 
Given  the  affine  model  parameters  [a\  a 2  as  a 4  a&],  the  motion  vectors  are: 


u(x,  y )  =  a\x  +  a2y  +  a3  ^ 

v(x,y)  =a4x  +  a5y  +  ae  ' 

Incorporating  these  equations  into  the  optic  flow  equation  with  illumination  model,  we  get: 

r)Ii(x,  y)  +  y  =  I2(x  +  a\x  +  a2y  +  a3,  y  +  a4x  +  a5y  +  ae),  (55) 

where  Ii(x,y)  is  the  image  intensity  at  pixel  (x, y)  in  the  first  frame,  I2{x  +  a\x  +  a2y +a3,y  +  a^x  + 
d5y  +  ae)  is  the  corresponding  pixel  intensity  in  the  second  frame,  and  77  and  y  are  the  illumination 
terms. 

Applying  Taylor  series  expansion  to  Equation  (55)  yields: 

Ix{x,y)(aix  +  a2y  +  03)  +  Iy{x,y){a^x  +  a5y  +  o6)  +  I2(x,y)  -  r]h(x,y)  -  y  =  0.  (56) 


We  now  define  the  cost  function 

^  =  f  Ix{x,y){aix  +  a2y  +  a3)  +  Iy(x,y)(ci4X  +  any  +  a6)+ 

“  I  h(x,y)  -yh(x,y)  -  y 
x<y 

and  by  taking  the  partial  derivatives  of  \l/  with  respect  to  affine  and  illumination  parameters,  and 
setting  them  to  zero,  we  end  up  with  a  linear  set  of  equations,  which  can  be  solved  easily. 


4.6.4  Experimental  results 

We  provide  the  results  of  an  experiment  to  demonstrate  the  usefulness  of  the  proposed  method. 
We  captured  a  video  sequence  using  a  Sony  DCR-TRV20  digital  camcorder.  While  capturing 
the  sequence,  we  increased  the  exposure  time  manually.  The  images  from  the  video  sequence  are 
given  in  Figures  35  to  37.  The  image  in  35  is  captured  with  a  short  exposure  time;  the  buildings 
outside  the  window  can  be  seen  clearly.  On  the  other  hand,  the  image  in  37  is  captured  with  a 
longer  exposure  time;  the  inside  can  be  seen  clearly,  but  outside  the  window  cannot  be  seen  due  to 
saturation.  We  applied  the  proposed  algorithm  to  these  images.  The  reconstructed  image  is  given 
in  Figure  38.  It  has  higher  dynamic  range  than  any  of  the  observations  does  alone.  Its  dynamic 
range  is  from  —22  to  227,  which  is  scaled  with  respect  to  the  first  observation.  (The  pixel  intensity 
range  is  given  as  a  side  bar.) 
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Figure  35:  First  observation. 


Figure  36:  Second  observation. 


Figure  37:  Third  observation. 
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4.7  Multi-Frame  Information  Fusion  for  Gray-Scale  and  Spatial  Enhancement 
of  Images 

An  image  is  a  two-dimensional  projection  of  a  real- valued  scene  q(t]  A;  x,  y,  z )  that  is  a  function 
of  time  t,  spectral  wavelength  A,  and  space  (x.  y,  z).  During  the  imaging  process,  the  quantity 
q(t;  A;  x,  y,  z )  is  degraded  in  several  ways.  The  degradation  may  occur  on  the  domain  (<;  A;  x,  y,  z) 
and/or  on  the  range  q.  By  combining  multiple  measurements,  image  fusion  algorithms  produce  a 
composite  (or  a  set  of  composites)  that  has  more  information  than  any  of  the  individual  measure¬ 
ments.  With  image  fusion,  we  can  improve  resolution  and/or  extent  in  the  four  domains:  temporal, 
spectral,  spatial,  and  gray-scale. 

A.  Temporal  Improvement:  Frames  of  a  video  sequence  are  captured  at  a  certain  frame  rate,  and 
events  that  occur  faster  than  the  frame  rate  are  not  visible  or  else  are  observed  incorrectly.  It  is 
possible  to  catch  events  that  are  faster  than  the  frame  rate  when  the  motion  can  be  modeled  [51] 
and/or  when  there  are  multiple  cameras  that  capture  the  scene  with  different  time  offsets  or  frame 
rates  [56]. 

B.  Spectral  Improvement:  Spectral  filters  are  used  to  sample  a  certain  portion  of  the  spectrum. 
When  the  purpose  is  to  produce  a  color  picture,  three  different  spectral  filters  usually  provide 
sufficient  visual  quality  for  human  viewing.  However,  in  multi-spectral  and  hyper-spectral  imagery, 
high  spatial  and  spectral  resolution  is  desirable  [26]. 

C.  Spatial  Improvement:  The  result  of  an  imaging  process  is  two-dimensional  data  that  is  limited 
in  its  spatial  extent  and  resolution.  By  combining  multiple  observations,  it  is  possible  to  improve 
the  spatial  information  content.  The  reconstruction  algorithm  can  seek  three-dimensional  or  two- 
dimensional  improvement.  For  instance,  super-resolution  reconstruction  improves  (2D)  spatial 
resolution  by  sub- pixel  registration  of  images  [42,  53]. 

D.  Gray-Scale  Improvement:  Image  sensors  produce  continuously  varying  voltages  that  are  propor¬ 
tional  to  the  amount  of  light  falling  on  them.  Because  of  the  limited  dynamic  range  of  the  sensors, 
only  a  portion  of  the  real  dynamic  range  is  captured.  This  is  not  the  only  source  of  information 
loss  in  the  gray-scale  domain,  however.  In  order  to  process  this  data  in  the  digital  domain,  the 
continuum  of  gray-scale  values  are  quantized  into  a  set  of  discrete  values.  The  number  of  distinct 
gray  levels  is  kept  small  for  data  storage  and  processing  efficiency.  As  a  result,  a  digital  image  is 
limited  in  both  its  gray-scale  extent  and  its  gray-scale  resolution. 

Recent  work  by  Mann  [43],  Robertson  et  al.  [48],  and  Candocia  [10]  demonstrate  how  to  improve 
the  dynamic  range  (gray-scale  extent)  of  the  image  by  combining  images  captured  with  different 
exposure  times.  Mann  registered  the  images  in  range  as  a  weighted  sum  of  the  pixel  intensities. 
The  weights  were  determined  as  a  measure  of  reliability  of  the  observed  data.  In  Robertson  [48] 
and  Candocia  [10]  works,  maximum  likelihood  and  least  squares  estimation  are  used,  respectively. 

In  this  project,  we  propose  a  set-theoretic  image  fusion  algorithm  that  improves  resolution  and 
extent  in  both  the  gray-scale  (range)  and  spatial  domains.  Although  considerable  work  has  been 
done  to  improve  the  spatial  resolution,  none  of  these  super-resolution  algorithms  consider  improv¬ 
ing  the  range  (gray-scale  extent  and  resolution).  However,  the  observations  may  provide  diverse 
information  about  range  due  to  changes  of  illumination  in  the  scene  or  imaging  device  adjustments 
such  as  exposure  time,  gain,  or  white  balance,  which  can  be  used  to  produce  composite  images  of 
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higher  dynamic  range  (gray-scale  extent)  and  tonal  fidelity  (gray-scale  resolution)  in  addition  to 
higher  spatial  resolution  and  extent.  The  proposed  algorithm  can  be  considered  as  a  generalization 
of  super-resolution  algorithms. 


Gain  Offset  Saturation  Digitization 
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Figure  39:  Imaging  model  includes  effects  acting  on  the  gray-scale  domain  as  well  as  on  the  spatial 
domain. 


4.7.1  Imaging  Model 

In  this  section,  we  use  a  more  general  image  acquisition  model  (than  the  ones  used  in  current 
super-resolution  algorithms)  that  includes  the  factors  that  act  in  the  gray-scale  domain  (such  as 
the  exposure  time,  white  balance  adjustment,  saturation)  in  addition  to  the  factors  that  act  in 
the  spatial  domain  (such  as  the  point  spread  function  of  the  sensors  and  sampling).  The  overall 
model  is  illustrated  in  Figure  39.  Light  coming  from  the  spectrally  and  spatially  varying  scene 
q(t-,  A;  x,  y,  z )  is  passed  through  an  optical  system  to  form  a  two-dimensional  real-valued  function 
/( x,  y )  on  a  sensor  array.  It  is  assumed  that  the  scene  is  static  (time-invariant)  during  the  exposure 
time.  A  spectral  filter  captures  a  certain  portion  of  the  light  spectrum.  The  sensors  may  have  a 
nonlinear  response  to  the  amount  of  light  falling  on  the  sensor  surface.  This  nonlinearity  is  usually 
by  an  exponential  function,  which  is  known  as  the  gamma  factor.  Most  commercial  cameras  have 
a  built-in  gamma  correction  circuit  that  linearizes  the  relationship  between  the  impinging  light 
intensity  and  the  image  output  level.  The  rest  of  the  imaging  system  converts  the  signal  f(x,y), 
which  is  continuous  in  the  spatial  and  gray-scale  domains,  into  a  digital  image.  The  process  can 
be  separated  into  two  parts.  The  first  part  is  a  sensor  array  that  converts  the  continuous  signal 
into  a  discrete  array  using  its  discrete  sensor  elements.  Each  sensor  element  is  an  analog  device 
with  nonzero  physical  dimensions  and  produces  a  real-valued  pixel  intensity.  The  second  part  of 
the  process  acts  on  these  pixel  intensities.  The  camera  may  adjust  the  exposure  time  and  white 
balance  (automatically  or  manually  by  the  user) ,  which  affect  the  pixel  intensities  as  gain  and  offset 
terms,  respectively.  The  pixel  intensities  are  limited  to  a  certain  dynamic  range  and  are  quantized 
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to  a  certain  number  of  bits.  The  result  is  a  digital  image  z (mi,  m2). 

Current  super-resolution  algorithms  try  to  reconstruct  the  signal  f(x,y )  without  modeling  the 
processes  acting  on  the  gray-scale  domain  of  the  images.  In  practice,  the  signal  is  reconstructed  on 
a  discrete  grid  (?m  ,  77,2)  instead  of  the  continuous  coordinates  ( x ,  y),  that  has  a  finer  resolution  than 
the  (mi,  m2)  grid.  Incorporating  the  motion  between  observations,  the  super-resolution  algorithms 
model  the  imaging  process  as  a  linear  mapping  between  a  high-resolution  input  signal  /(ni,n2) 
and  low- resolution  observations  -2; (mi,  m2).  This  mapping  includes  motion  (of  the  camera  or  the 
objects  in  the  scene)  and  blur  (caused  by  the  point  spread  function  of  the  sensor  elements  and  the 
optical  system);  it  can  be  formulated  as 

2i(mi,m2)=  ^2  hi(m1,m2;ni,n2)/(ni,n2),  (58) 

ni,n2 

where  (ni,n2)  and  (mi,  m2)  are  the  discrete  coordinates  of  the  high-  and  low-resolution  images, 
respectively,  i  is  the  observation  number,  and  hi(mi,m2; 711,712)  is  the  linear  mapping  that  incor¬ 
porates  motion,  point  spread  function  (PSF),  and  downsampling.  Details  of  such  modeling  can  be 
found  in  [42,  53]. 

The  model  formulated  in  (58)  assumes  that  the  observations  are  captured  under  the  same  illumi¬ 
nation  conditions  and  camera  settings.  However,  this  assumption  is  not  always  valid.  Including 
the  effects  acting  on  range  (such  as  exposure  time,  gain,  offset,  etc.),  we  can  extend  the  equation 
given  in  (58)  as  follows: 


Zi(mi,m2)  =  Q 


Vi 


hj(mi,  m2;  77i,  7i2)/(ni,  7i2) 


(59) 


where  F[]  is  the  dynamic  range  saturation  function,  Q{}  is  the  gray-scale  quantizer,  and  ry  and  /q 
are  the  illumination  gain  and  offset,  respectively.  In  this  model,  the  illumination  gain  and  offset  are 
assumed  to  be  spatially  uniform.  This  is  a  valid  assumption  unless  there  is  non-uniform  illumination 
change  within  the  scene.  For  the  case  of  non-uniform  illumination  effects,  the  parameters  ry  and 
/iz  can  be  modeled  as  functions  of  the  spatial  coordinates.  A  typical  dynamic  range  saturation 
function  F[-}  is  depicted  in  Figure  40.  Because  of  the  nonlinearity  of  the  saturation  function,  the 
quantization  noise  is  larger  at  low  and  high  pixel  intensities  than  at  the  midrange. 


4.7.2  Joint  Gray-Scale  and  Spatial  Domain  Enhancement 

In  this  section  we  present  a  new  image  fusion  algorithm  for  joint  gray-scale  and  spatial  enhancement. 
We  define  constraint  sets  using  the  quantization  bounds  of  the  pixel  intensities,  and  employ  a 
projections  onto  convex  sets  (POCS)-based  algorithm  to  produce  an  image  of  higher  spatial  and 
gray-scale  information  content. 

Constraint  Sets  from  Amplitude  Quantization 

As  formulated  in  (59) ,  dynamic  range  compression  and  quantization  introduces  quantization  error 
in  the  pixel  intensities  of  the  observations.  Let  ^(7711,7772)  be  the  gray-scale  quantization  error  at 


50 


Figure  40:  Pixel  intensities  are  compressed  and  digitized  observations  of  real-valued  quantities.  For 
each  intensity  level,  there  is  a  lower  bound  !}(•)  and  an  upper  bound  Tu(-)  within  which  the  input 
intensity  lies. 

pixel  (mi,  m2)  of  the  ith  observation.  Then  we  can  rewrite  (59)  as 

Zi(mi,m2)  =  rji  1  hi(mi,m2;ni,n2)/(ni,n2) 

Vni,n2 


+  Hi  +  m2). 


(60) 


Although  the  exact  value  of  the  quantization  error  cannot  be  determined  from  the  measurements, 
the  bounds  within  which  it  lies  can  be  determined  when  the  saturation  curve  of  the  camera  is 
known.  As  illustrated  in  Figure  40,  the  input  intensities  are  mapped  by  the  nonlinear  saturation 
function  and  then  quantized  to  a  certain  number  of  bits,  typically  eight.  For  each  intensity  level, 
there  is  a  lower  bound  7}(-)  and  an  upper  bound  Tu(-)  within  which  the  input  intensity  lies;  and 
these  bounds  can  be  determined  from  the  saturation  curve. 

We  now  define  constraint  sets  using  the  bounds  of  the  gray-scale  data,  and  employ  a  POCS-based 
algorithm.  In  the  POCS  technique,  each  piece  of  information  (arising  from  the  observed  data  or 
prior  knowledge)  is  associated  with  a  constraint  set  in  the  solution  space;  the  intersection  of  these 
sets  represents  the  space  of  acceptable  solutions  [19].  By  projecting  an  initial  estimate  onto  these 
constraint  sets  iteratively,  a  solution  closer  to  the  original  signal  is  obtained.  In  this  problem,  the 
constraint  sets  arise  from  the  quantization  bounds  of  pixel  intensities.  Let  Tj  (I)  and  Tu  (I)  be 
the  lower  and  upper  quantization  bounds  for  a  measured  pixel  intensity  I,  and  let  x(ni,n2)  be  an 
estimate  of  the  high-quality  image  /(ni,n2).  Then  we  can  write  the  constraint  set  on  any  observed 
pixel  Zj(mi,m2)  as  follows: 

C[zj(mi,m2)]  =  {  /(ni,n2)  :  Tt  (zj(mi,m2))  <  |£j(mi,m2)|  <  Ttt(zj(mi,m2))  }  ,  (61) 

where  ^(^1^2)  is  the  calculated  pixel  intensity  derived  from  cc(ni,n2): 

Zi(mi,m2)  =r]i  I  ^2  hj(mi,m2;ni,n2)/(m,n2)  J  +/q.  (62) 

\ni  ,n2  / 
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When  there  are  multiple  observations,  the  intersection  of  the  constraint  sets  may  result  in  finer 
range  resolution  and  dynamic  range.  This  is  illustrated  in  Figure  41. 

In  [42],  a  projections  onto  convex  sets  (POCS)  algorithm  for  spatial-domain  enhancement  is  pre¬ 
sented.  In  that  algorithm,  bounds  on  the  constraint  sets  are  set  to  a  fixed  value,  which  is  chosen 
heuristically,  and  the  gray-scale  effects  are  ignored.  Here,  we  show  that  those  bounds  should  actu¬ 
ally  be  a  function  of  observed  pixel  intensity,  that  they  should  be  chosen  using  the  camera  response 
function.  Moreover,  the  illumination  changes  and  the  camera  settings  should  also  be  considered  in 
the  reconstruction. 


V _ _ _ / 


Extended  dynamic 
range 

Figure  41:  It  is  possible  to  increase  the  gray-scale  extent  and  resolution  when  there  are  multiple 
observations  of  different  range  spans. 


Projection  Operations 

For  each  pixel  in  the  observations,  we  can  define  a  constraint  set;  and  we  project  an  initial  estimate 
onto  these  constraint  sets  iteratively  to  update  the  initial  estimate.  The  result  is  a  solution  of 
higher  resolution  in  both  the  gray-scale  and  spatial  domains. 

The  projection  operator  should  update  x(n\,n2)  in  such  a  way  that  the  constraint  given  in  (61) 
is  satisfied.  We  design  the  projection  operator  so  that  it  projects  the  estimate  a:(ni,n2)  onto 
the  bounds  of  the  constraint  sets.  For  instance,  if  Zi{m\,m2)  is  larger  than  the  upper  bound 
Tu  (2^(7711,7712)),  then  the  projection  operator  updates  the  estimate  x(n\,n2)  such  that  the  new 
Zi{m\,m2)  (obtained  from  the  updated  x(ni,ri2))  is  equal  to  Tu  (z;(mi,  m2)).  After  some  simple 
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algebra,  the  projection  operator  onto  a  constraint  set  C  [2*  (mi,  m2)]  can  be  found  as 


Pc[zi(m  1,7712)]  [^(^l  5  ^2)]  ^(7^1,712)  +  \ 


(r„(»,cn))-i,w-„)s.(m;n),  it(m)>TMm))  | 

0,  Ti  (Zi(m))  <  |ij(m)|  <  Tu  (zj(m))  }  , 


(  S,(m;  n),  *(m)  <  T*  (zj(m))  j 

(63) 

where  m  =  {mi, m2),  n  =  (ni,n2),  x{rii,  112)  is  an  estimate  of  the  original  scene  /(nj,  772), 
Zi{mi,m2 )  is  the  calculated  intensity  from  the  estimate  x(ni, 712)  as  given  in  Equation  (62),  and 
hi{ m;n)  is  the  normalized  blurring  function: 


K{  m;  n) 


hi  (mi,  m2;  111,112) 

Y  |hi(m1,m2;n1,n2)|2 

711,712 


(64) 


4.7.3  Implementation 

In  this  section  we  summarize  our  implementation  of  the  algorithm.  Given  a  set  of  images,  we 
estimate  the  motion  and  illumination  parameters,  register  the  images  in  spatial  position  range  and 
range  to  obtain  an  initial  estimate,  and  then  project  the  initial  estimate  onto  the  constraint  sets 
(defined  for  each  pixel  in  the  observations)  iteratively. 

In  our  implementation  we  used  the  six-parameter  affine  motion  model.  To  determine  the  affine 
parameters,  we  use  the  Harris  corner  detector  [32]  to  select  a  set  of  points  in  the  reference  image. 
We  then  find  the  correspondence  points  in  the  second  image  using  normalized  cross  correlation  with 
quarter-pixel  accuracy.  A  least-mean-square  estimate  of  the  affine  parameters  can  be  determined 
from  the  motion  vectors  determined  at  these  points.  Once  the  affine  parameters  are  determined, 
the  images  are  spatially  registered  and  the  illumination  parameters  %  and  fii  are  estimated,  again 
using  minimum  least-mean-square  estimation. 

The  initial  estimate  is  obtained  by  bilinearly  interpolating  one  of  the  observations,  and  then  extend¬ 
ing  the  dynamic  range  by  registering  the  other  images  in  range.  To  perform  range  registrations,  the 
images  are  first  spatially  registered  (by  warping  on  to  the  reference  image  using  the  affine  param¬ 
eters),  and  then  registered  in  range  by  taking  a  weighted  sum  of  the  pixels.  The  weight  function 
is  chosen  to  be  a  Gaussian-like  function;  its  mean  is  set  to  the  middle  point  of  the  dynamic  range, 
which  is  a  result  of  the  relative  reliability  of  the  midrange  intensities  [48] . 


4.7.4  Experimental  Results 


We  provide  the  results  of  an  experiment  to  demonstrate  the  proposed  method.  We  captured  a  video 
sequence  using  a  Sony  DCR-TRV20  digital  camcorder.  While  capturing  the  sequence,  we  increased 
the  exposure  time  manually.  We  used  three  images  from  the  sequence  in  the  reconstruction.  These 
images  are  given  in  figures  42(a),  42(b),  and  42(c).  The  image  in  Figure  42(a)  was  captured  with 
a  relatively  short  exposure  time.  In  Figure  42(a)  the  low-contrast  details  in  the  lighter  regions  can 
be  seen  better,  while  in  Figure  42(c)  the  tonal  fidelity  is  higher  for  the  darker  regions. 
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We  applied  the  proposed  algorithm  to  these  images.  The  corners  detected  for  the  reference  images 
are  depicted  in  Figure  42(d).  The  correspondence  points  are  found  using  normalized  cross  correla¬ 
tion  with  quarter-pixel  accuracy,  a  block  size  of  8  pixels,  and  a  search  range  of  10  pixels.  The  PSF 
is  taken  as  a  7  x  7  Gaussian  window  with  a  standard  deviation  of  one.  Once  the  initial  estimates 
are  obtained,  they  are  projected  onto  the  constraint  sets  iteratively.  The  number  of  iterations  is  set 
to  seven.  The  reconstructed  image  is  given  in  Figure  43.  It  have  higher  dynamic  ranges  than  any  of 
the  observations  does  by  itself.  (The  pixel  intensity  range  is  given  as  a  side  bar.)  The  low-contrast 
regions  also  become  more  clear  in  the  reconstructed  images.  After  getting  the  high-dynamic-range 
image,  certain  portion  of  its  range  can  be  chosen  for  display  purposes.  In  Figure  44,  we  show 
zoomed  regions  from  these  images.  They  are  also  scaled  in  intensity  to  range  [0  —  255].  Close 
examination  of  these  figures  shows  that  the  spatial  resolution  has  also  been  improved  during  the 
reconstruction. 


4.7.5  Conclusions 

In  this  project,  we  presented  an  image  fusion  algorithm  that  improves  resolution  and  extent  in 
both  the  gray-scale  and  spatial  domains.  (In  the  experiments  we  did  not  demonstrate  increasing 
spatial  extent,  which  is  a  relatively  straightforward  application.)  We  included  the  gray-scale  effects 
of  the  image  acquisition  process  and  proposed  a  projections  onto  convex  sets  based  reconstruction 
algorithm.  The  constraint  sets  are  obtained  from  the  gray-scale  quantization  information.  Including 
the  range  effects,  the  algorithm  can  be  considered  as  a  generalization  of  spatial  super-resolution 
algorithms. 


M  (b) 


Figure  42:  Images  from  the  second  sequence,  (a)  First  image,  (b)  Second  image,  (c)  Third  image, 
(d)  Corners  detected  in  the  third  image. 
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Figure  43:  Reconstructed  image  from  the  second  sequence. 


(0  m> 


Figure  44:  Zoomed  regions  from  the  second,  (a)  First  image,  (b)  Second  image,  (c)  Third  image, 
(d)  Reconstructed  image  scaled  to  intensity  range  [0-255]. 

4.8  Super-Resolution  Reconstruction  of  Hyper-Spectral  Images 

During  the  late  1950s,  the  digital  computers  began  to  emerge  as  a  vital  tool  for  dealing  with  huge 
amounts  of  data.  Essentially  at  the  same  time,  significant  developments  in  space  technology  made 
the  artificial  satellites  possible.  The  rapidly  growing  data  processing  techniques  enabled  the  use 
of  space  imagery  for  obtaining  information  and  making  decisions.  After  decades  of  research,  space 
imagery  is  now  a  mature  field  with  military  and  civilian  applications,  such  as  target  detection, 
tracking,  mineral  exploration,  and  agriculture. 

One  of  the  important  parameters  in  a  space  imagery  system  is  the  spatial  resolution.  There  are 
various  effects  (atmospheric  scattering,  secondary  illumination  changing  viewing  angle,  sensor  noise 
just  to  name  a  few)  that  degrade  the  acquired  image  quality.  It  is  obvious  that  any  improvement 
in  spatial  resolution  will  pay  off  greatly.  To  improve  spatial  resolution  we  make  use  of  the  super¬ 
resolution  techniques  together  with  the  information  in  different  wavelengths  of  the  captured  spectra, 
which  is  made  available  with  the  development  of  the  hyper-spectral  sensors. 

An  integral  part  of  our  work  is  to  model  the  hyper-spectral  image  acquisition  process.  To  get 
the  best  results  we  require  our  imaging  model  to  be  complex  enough  to  incorporate  all  the  effects 
mentioned  above.  On  the  other  hand,  the  model  should  be  kept  as  simple  as  possible  to  avoid 
computational  complexity  problems.  Currently,  we  have  a  candidate  model  which  is  being  tested 
under  various  conditions.  The  simulation  results  obtained  using  this  model  will  also  be  presented. 
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4.8.1  Image  Acquisition  Model 


This  section  starts  with  some  background  information  about  the  physical  phenomena  we  are  trying 
to  model.  In  the  following  section  hyper-spectral  image  acquisition  process  will  be  briefly  discussed 
from  a  physical  point  of  view,  together  with  the  atmospheric,  environmental  and  device-dependent 
effects  that  influence  the  overall  process.  We  will  describe  our  hyper-spectral  image  acquisition 
model,  whereby  source  image  samples  are  interpreted  as  aliased  and  optically  blurred  linear  com¬ 
binations  of  the  target  image’s  spectral  basis.  Then  a  rigorous  mathematical  formulation  of  the 
proposed  model  will  be  provided. 

•  Background 

To  be  able  explain  a  hyper-spectral  image  we  require  some  background  information.  Any  physical 
object  in  a  scene  reflects,  absorbs  and  emits  electromagnetic  radiation  in  ways  characteristic  of  its 
molecular  composition  and  shape.  Electro-optical  remote  sensing  is  the  process  of  using  this  fact 
to  obtain  information  about  an  object  or  scene  without  coming  into  physical  contact  with  it.  If 
the  radiation  arriving  at  the  sensor  array  is  measured  at  a  sufficiently  high  number  of  wavelengths 
for  every  pixel,  the  resulting  spectrum  can  be  used  to  extract  information  that  cannot  be  found  in 
images  captured  by  conventional  devices  (that  give  no  or  very  little  information  about  the  spectral 
dimension).  Spectroscopy  is  the  field  concerned  with  the  measurement,  analysis  and  interpretation 
of  such  spectra.  Another  related  field,  which  is  actually  a  branch  of  spectroscopy,  is  imaging 
spectroscopy.  It  can  be  explained  as  combining  spectroscopy  with  methods  to  acquire  spectral 
information.  Hyper-spectral  sensors  are  a  class  of  imaging  spectroscopy  sensors,  for  which  the 
waveband  of  interest  is  divided  into  hundreds  of  essentially  continuous  narrow  bands.  As  the  name 
suggests,  the  hyper-spectral  sensors  differ  from  their  predecessors,  the  multi-spectra!  sensors;  in 
that  the  number  of  bands  which  can  be  sensed  separately  is  much  higher.  (For  example  AVIRIS 
Airborne  Visible/Infrared  Imaging  Spectrometer  from  NASA/JPL  has  224  bands.)  Hyper-spectral 
images  are  the  name  given  to  multi-channel  images  captured  by  hyper-spectral  sensor  arrays.  They 
are  generally  the  data  type  obtained  for  space  imagery  application  like  mining,  civil  engineering, 
and  military  applications  such  as  mine  detection  and  information  gathering. 

For  a  given  ground  pixel,  whose  dimensions  can  be  in  the  range  of  tens  of  centimeters  to  tens  of 
meters  depending  on  the  spatial  resolution  of  the  imaging  device,  the  radiance  observed  at  any 
particular  wavelength  is  determined  by  the  reflectance  of  the  matter  and  the  solar  illumination  at 
that  wavelength.  However,  there  exist  many  important  additional  effects,  which  include  but  are  not 
limited  to:  imperfect  optics  of  the  imaging  device,  the  upwelling  solar  radiance  from  atmospheric 
scattering;  the  secondary  illumination  of  the  object  by  light  reflected  from  adjacent  objects;  the 
scattering  and  absorption  of  the  reflected  radiance  by  the  atmosphere;  spatial  and  spectral  aberra¬ 
tions  in  the  sensors,  finite  sensor  dimensions  and  viewing  angle  of  the  sensor  array.  Characterizing 
and  compensating  for  these  effects  is  a  key  step  in  the  exploitation  of  hyper-spectral  imagery.  Al¬ 
though  the  model  we  propose  is  very  general  in  the  sense  that  it  makes  no  specific  assumptions 
about  the  imaging  device  used  and  incorporates  all  effects  that  have  influence  on  the  spatial  and 
spectral  resolution  of  the  observed  scene,  it  excludes  some  physical  effects,  which  are  directly  re¬ 
lated  to  the  sensor  characteristics  and  secondary  illumination  effects.  In  the  starting  phase  of  this 
project,  we  assume  that  the  atmospheric  compensation  and  sensor  calibration  methods  have  already 
been  applied  and  we  focus  on  the  image  processing  side  the  process. 

•  Mathematical  Model 
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Gain  Offset  Saturation  Digitization 
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Figure  45:  The  hyper-spectral  image  acquisition  model. 

The  block  diagram  shown  in  Fig.  45  depicts  the  system  model.  The  ideal  continuous-time, 
continuous-space  and  continuous-spectrum  video  signal,  denoted  by  f(xi,X2,t,X),  represents  the 
actual  input  to  the  imaging  device.  Ideally,  we  would  like  to  reconstruct  f(x\,X2,t,X)  from  the 
available  observations  captured  by  some  hyper-spectral  imaging  device.  However,  it  is  not  feasi¬ 
ble  to  reconstruct  the  continuous  signal  f(xi,X2,t,X).  We  will  overcome  this  by  dimensionality 
reduction  and  discretization  of  the  spectral  and  spatial  dimensions. 

It  is  a  well-known  fact  that  the  spectral  reflectance  of  natural  images  can  be  accurately  modeled 
in  terms  of  linear  combinations  of  a  relatively  small  number  (generally  no  more  than  seven)  of 
reflectance  basis  functions.  These  illuminant-independent  orthonormal  basis  functions  can  be  ob¬ 
tained  by  computing  the  first  P  principal  components  for  a  large  set  of  natural  image  reflectances. 
As  a  first  step  in  our  model,  we  assume  that  f(xi,X2,t,X)  is  represented  as  linear  combination  of 
these  basis  functions.  That  is,  at  every  pixel  location,  f(xi,X2,t,X)  is  given  by  a  P  dimensional 
vector  where  the  elements  of  this  vector  are  the  coefficients  of  corresponding  orthonormal  basis 
functions.  Once  this  idea  is  established,  we  can  proceed  with  the  spatial  domain.  To  deal  with  the 
spatial  domain,  we  hypothesize  that  for  each  of  the  P  image  planes,  there  exists  a  corresponding 
discrete,  high-resolution  video  scene  fj{ni,ri2,t )  (j  =  1,2, ...,P)  and  we  seek  to  reconstruct  an 
image  from  that  signal  sampled  at  a  specific  time  instant,  namely  fj(ni,ri2,tr).  The  main  assump¬ 
tion  here  is  that  the  continuous  signal  fj(x\,X2,t )  can  be  reconstructed  from  the  spatially  discrete 
high-resolution  image  fj(ni,ri2,t)  through  an  ideal  reconstruction  filter  hr. 

In  the  sections  to  follow  we  will  model  the  combination  of  the  image  acquisition,  spatial  and  spectral 
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filtering  and  sampling.  Although  the  ideas  we  are  trying  to  convey  are  rigid,  and  the  path  we  take 
to  do  so  is  mathematically  rigorous,  one  can  easily  get  confused  with  the  notation  used.  To  leave 
no  room  for  possible  misunderstandings,  we  end  this  section  with  a  description  of  the  mathematical 
notation  that  is  used  through  out  the  report.  The  hyper-spectral  image  data  is  best  represented  in 
the  form  of  a  P-dimensional  vector  for  each  pixel  where  P  is  the  number  of  spectral  bands.  Following 
this  convention  we  write  fjn]  =  [/i[n]  /2[n]  ...  /p[n]]T  to  denote  the  P-dimensional  pixel  at  the 
location  n  —  [ni  n2]T.  We  use  fj(x  1,0:2)  for  the  spatially  continuous  target  image  planes  and 
fj[ni,ri2)  for  the  spatially  discrete  target  image  planes.  Similarly,  gi(x  1,0:2)  denotes  continuous 
observation  image  planes  and  gi(ni,ri2 )  denotes  discrete  observation  image  planes.  Hence,  any 
pixel  denoted  by  / — no  matter  what  the  subscript  or  the  indices  may  be — is  a  target  image  pixel, 
and  the  same  applies  to  g  with  observation  image  pixels.  Furthermore,  at  some  point  it  will  be 
necessary  to  differentiate  between  high-  and  low-resolution  grid  pixels.  For  this  purpose,  the  high- 
resolution  grid  pixels  are  indexed  with  n  =  [n\  n2]T  and  low-resolution  grid  pixels  are  indexed  with 

TO  =  [mi  TO2]T. 

•  Conversion  from  the  Discrete  2D  Sequence  into  a  2D  Impulse  Train 

The  first  step  in  the  ideal  reconstruction  process  is  the  conversion  of  the  discrete  signals  into  2D 
impulse  trains.  Since  the  following  operations  are  performed  on  each  of  the  P  target  image  planes, 
we  will  drop  the  subscript  j  and  denote  the  resulting  signal  as  fs(xi,x2,t): 

N  N 

fs{xi,x2,t)  =  V  E  /[m,n2)t]5(a:i  -  ~  t^)-  (65) 

711=0)12=0  Al  *2 


Note  that  the  spatial  sampling  frequency  is  normalized  for  the  low-resolution  grid  so  that  Ai  and 
A2  show  the  increase  in  the  spatial  sampling  density  when  we  move  from  the  low-resolution  image 
(observation  or  source)  to  the  high-resolution  image  (target).  In  other  words,  if  we  assume  that  the 
sampling  distance  in  the  low-resolution  image  is  1  unit,  then  in  the  same  unit,  the  high-resolution 
image  has  Ai  and  A2  samples  in  the  horizontal  and  vertical  directions  respectively. 

•  Ideal  Reconstruction  Filter 


We  implicitly  assume  that  the  high-resolution  target  image  (hence  its  reconstructed  version)  exists 
at  all  times  t.  Therefore,  in  the  following  equations,  time  index  t  is  suppressed.  Keeping  this  in 
mind,  convolution  with  the  reconstruction  filter  can  be  written  as 


f{x  i,x2)  =  //,.(*  -ui,x2-  u2)hr{ui,u2)du\du2. 


(66) 


Substituting  fs(x i,x2)  from  (65)  we  get 

N  N 


f(xi,x2)=  E  E  /[ra  i>n2]<?(ai  -ui-~-,X2-u2-  Y-)hr(ui,u2)  duidu2. 

711=0  722=0  1  2 

Assuming  convergence  we  can  exchange  the  order  of  summations  and  integrals  to  write 
N  N  .. 

f(xi,x2)  =  V  E  /[m.na]  // d(xi-ui--2,X2-U2-^-)hr(ui,u2)duidu2. 

nf=0  n2=0  JJ  Al  A2 


(67) 


(68) 
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Using  the  sifting  property  of  impulse  functional,  we  get 


N  N 


f(x  l,X2)  =  EE  f[ni,n2]hr(x i  -  -^,x2  -  -^), 


ni=0n2=0 

or  if  we  include  the  suppressed  time  variable  t: 

N  N 


A2' 


f(xi,X2,t)  =  V  Y]  f[ni,n2,t]hr(xi  -  -^,ff2  -  t^)- 
n7=0  ri2=0  Al  A2 


(69) 


(70) 


•  Spectral  Representation  with  Predetermined  Basis  Functions 


We  assume  that  the  basis  functions  are  predetermined  by  applying  the  principal  component  analysis 
(PCA)  on  training  data  and  selecting  the  first  P  principle  components.  If  we  denote  the  continuous 
signal  as  fc(x i,  £2,  t ,  A),  then  we  have 


p 

fc(x\,  ff2,  t,  A)  -  y'  bj(\)fj(xi,X2,t). 

j- 1 

Noting  that  (70)  applies  to  each  of  the  P  target  image  planes,  we  can  write 


P 

fc(xi,x2,t,  A)  =  ^2bj{ A) 
J=1 


■  JV  JV 

X!  5Z  /i  tnl  1  n2  >  *]  hr  (ffl.t  — 
„  711=0  772  — 0 


ni 

Ai 


(71) 


(72) 


•  Spatial  Filtering  (Optical  Lens  &  Sensor  Integration  Blur  Model) 

We  use  /ia(xi, a;2)  to  denote  the  spatially  invariant  blur  filter  at  every  wavelength  A.  This  models 
the  imperfect  imaging  optics  (e.g.  lens  blur)  and  the  unavoidable  sensor  integration  blur  due  to  the 
finite  sensor  area.  The  blur  operation  can  be  written  as  the  convolution  of  the  target  image  planes 
with  the  blur  filter: 


fc,b{x  1,  3!2,  t, 


h\{x  1  -  Ui,x2  -  V2)fc{y\,V2,t,  A )duidi>2, 


(73) 


where  subscript  c,  b  means  continuous  and  blurred.  We  use  the  motion  mapping  M  for  relating  the 
frames  occurred  at  different  times.  M  =  ( Mi,M2 )  is  defined  as 


ffi.tr  =  Mi(x1,x2,t>tr), 

ff2.tr  =  M2(xi,X2,t,tr)  ( tr  >  t ).  (74) 

For  the  case  at  hand,  we  can  write  /c(ffi,tr,ff2.tr,ir,  A)  =  fc(is i,i/2,t,  A).  Then  by  using  the  inverse 
of  the  mapping  mentioned  above,  we  get 


fc,b{.X\i  ff2>  t)  A)  J  J  h\(x\  M j  (ff  l,tr  ,  ff2,tr ,  t)  tr ) ,  ff2  A^2  (ffl,tr,  ff2,tr)^,  *r))|J| 

X  fc(xi,tr,X2,tr,UA)dxi,trdx2,tr,  (75) 
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where  |J|  is  the  Jacobian  of  the  motion  mapping.  If  we  make  the  definition 
hv,\(xi,X2-,xi!tr,X2,tr-,t-,tr )  =  |J|hA(«i  -  M{1(xittr,X2,tr,t,U),X2  ~  M21(xx,tT,X2,tT,t,tr)),  (76) 
we  can  write  fc,b(xi,X2,t,  A)  as  follows: 

fc,b(xi,x2,t,X)  =  jj  hVtx{xi,X2;xi>tT,X2,tr\t-,tr)fc(xi>tr,X2,tr,tr,\)dxiitrdx2,tr-  (77) 
Substituting  the  previously  derived  /c(^i5ir,^2,tr,ir?  A)  into  this,  we  get 

“  J ' j  j  %2]  %1  ,tri  %2,tr !  ^5  ^r) 

71  i  r)n 

dxittrdx2,t 


’  P  N  N 

XMA)  X  X  /?'  tnl  >  n2 )  ^r]  K  (*l,tr  -  ?W  “  T^) 

.  j=l  ni=0n2=0  Xl  X<2  . 


Again  assuming  convergence,  we  can  exchange  the  integrals  and  summations  to  obtain 
fCib{xi,X2,t,  A) 

p  N  N 

=  5XA)  X  X  /j[nl>n2,*r] 

j=l  ni=0  ri2=0 

^ ^  ^t,a(*£1  j  *^2j  »  ^2}tr )  t^hr  (#l,tr  )^*^'lJtr^^2,ir* 

To  get  a  simpler  form,  we  make  the  following  definition: 


(78) 


(79) 


j  ^2j  ^1  j  77-25  ^5 


— )dzi,trdz2,ir,  (80) 
A2 


which  allows  us  to  write 


p  N  N 

fc,b(xu x2, t,  A)  =  X bj(A)  X  X  /j[ni;n2,M^6(a:i)g2;ni,n2;t;tr)-  (81) 

j=i  m=o  712=0 


•  Spectral  Filtering 

The  spectral  response  functions,  ri(  A),  ri(A)  ...  r<g(A)  model  the  hyper-spectral  sensors’  efficiency 
at  different  wavelengths  as  well  as  the  atmospheric  and  illuminator-based  effects  on  the  spectrum. 

poo 

9i(xi,X2,t)  —  /  fc,b(xi,X2,t,  A)r*(A)  dA.  (82) 

Jo 

•  Spatial  Domain  Sampling 
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Next,  we  must  discretize  the  observations  for  practical  implementation.  This  is  done  by  sampling 
the  giS  on  a  low-resolution  grid. 


bj(\)n(X)  dX 


N  N 

X  X  fAni>n2,tr]hb(rni,m2;ni,n2-,tk;tr), 
m=o  ri2=0 


(83) 


where  the  second  equality  follows  from  the  assumption  that  the  integrals  and  summations  converge, 
and  hence  we  can  exchange  their  order.  If  we  call  the  integral  in  brackets  as  Wij,  then  we  can  write 

P  N  N 

gi{m1,m2,k)  =  J2wi,j  EE  fj  [rai ,  712 ,  tr  ]  hb  (mi ,  m,2 ;  n\ ,  n2 ;  tk ;  tr ) ,  (84) 

j= 1  711=0  77-2=0 


or  in  matrix  form,  with  Q  =  31  and  P  —  6, 


’  Pi 

wlyl  *■ 

*  ^1,6 

’  h[n]  ■  him] m  tk\tT)  ' 

92 

= 

-0)2,1  * 

•  ^2,6 

f2\n }  ■  hb(m]  n;  tk;tr) 

.  P31 

.  W31,l  • 

■  •  ^31,6  _ 

_  k\n )  ■  hb(m; n; tk\tr)  _ 

where  we  made  the  following  substitution  to  get  a  simpler  looking  equality: 

N  N 

fj  [23J  *  hh(m;  ti\  tr)  =  EE  fj[ni,n2,  tr]hb(mi,m2;ni,  n2\tk\  tr).  (85) 

m—0  ?i2=o 


•  Additive  Noise 

v(mi,m2,k)  models  the  total  effect  of  all  possible  noise  sources  that  exist  through  out  the  whole 
acquisition  process. 


4.8.2  Solving  the  Inverse  Problem 

Given  the  model  presented  in  the  previous  section,  the  inverse  problem  can  be  stated  as  finding 
the  target  image  that  is  in  as  much  agreement  as  possible  with  the  observations.  Here  being  in 
agreement  deserves  some  explanation.  When  we  say  the  candidate  target  image  is  in  agreement 
with  the  observations  we  mean  that  if  we  apply  the  linear,  time  and  space-varying  (LTSV)  filter  hb 
in  (84)  to  the  candidate  target  image,  the  resulting  synthetic  observation  image  is  close  to  the  real 
observations  captured  by  the  imaging  device  under  consideration.  There  exist  many  ways  to  solve 
this  problem,  each  with  its  advantages  and  disadvantages.  For  example,  we  can  try  to  minimize 
the  squared  error  between  the  observed  images  and  the  synthetically  produced  observation  images 
by  using  the  well-studied  least  squares  methods.  The  drawback  of  this  approach  is  that  it  requires 
the  computation  of  inverses  of  large  matrices,  which  is  in  most  cases  difficult  to  compute.  A  widely 
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preferred  alternative  is  the  iterative  set-theoretic  methods.  In  this  paper,  we  will  use  POCS  method 
to  solve  the  inverse  problem  addressed  above. 

Regardless  of  the  method  used  to  solve  for  the  target  image,  a  good  share  of  the  total  effort  is  put 
into  calculating  the  LTSV  blur  filter  hb.  Prom  (80),  we  can  see  that  hi,  has  a  complex  structure. 
It  depends  on  the  reconstruction  and  spatial  blur  filter  as  well  as  the  motion  in  the  scene.  Due  to 
these  dependencies,  the  computation  of  hb  can  easily  get  involved.  Furthermore,  hi,  is  only  valid 
where  the  motion  is  accurately  modelled,  making  the  precision  of  the  motion  vectors  used  in  the 
computations  extremely  important. 

In  many  cases,  computational  load  or  real-time  implementation  specifications  renders  a  direct 
calculation  impossible  and  we  are  required  to  approximate  hb .  A  good  understanding  of  the  LTSV 
filtering  operation  given  in  (84)  is  helpful  in  precisely  approximating  hb .  For  this  reason,  we  will 
study  two  special  cases,  namely  single  frame  and  multiple  frames  with  translational  motion.  It  will 
turn  out  that,  in  these  cases  hb  is  fairly  easy  to  compute  and  has  a  nice  interpretation  which  sheds 
light  on  how  we  can  approximate  it. 

•  Single  Frame  Case 

In  the  single  frame  case,  we  have  only  one  hyper-spectral  observation,  which  is  a  set  of  monochro¬ 
matic  images  of  the  same  scene  captured  at  different  wavelengths.  Our  aim  is  to  reconstruct  the 
high-resolution  target  image  planes  fj(x i,  x2,  A)  for  j  —  1, 2, . . . ,  P,  from  which  we  can  obtain  the 
spatially  and  spectrally  continuous  target  image  by  using  the  spectral  basis  functions.  Note  that 
the  time  dependency  is  dropped  since  we  are  working  with  only  one  observation.  Following  the 
same  steps  as  in  the  previous  sections  one  can  show  that  for  single  observation  case,  fc,b(xi,x2,  A) 
is  given  by 


N  N 


fc,b{x  1,X2,X)  =  Y/bj(. A)  Y  Y  /j[n l>n2]  /  /  M*1  -  eUx2  -  €2)Mei  ~Y’€2~  dClde 2' 

j= 1  711=0  712=0  ^  '  12 

(86) 

If  we  make  the  definition 

l(xi,X2)  =  h\{x i,x2)  *  hr(x i,x2),  (87) 

where  *  denotes  convolution,  we  can  see  that 


P  N  N 

fc,b(x  1,X2,A)  =  ^2bj( A)  Yj  /j[n Un2h{xi  -  YL,X2  -  y^). 

j=l  nr=0  n2=0  Ai  A2 


Continuing  with  spectral  filtering  and  sampling  on  a  low-resolution  grid  gives 

p  N  N 

9i(mi,m2 )  =  Y whj  Y  Y  /j(n un2h(mi  -  ^,m2  -  ^). 


j= 1  711=0  712=0 


(88) 


(89) 


•  Multi-Frame  Translational  Motion  Case 

In  the  translational  motion  case,  we  have  many  observations  but  the  motion  in  the  observed  scene 
is  constrained  to  be  global  translational  motion.  This  type  of  motion  can  be  incorporated  into  the 
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model  by  letting 

Mi(xi,X2,t,tr )  =  ®i  +  <5i(tr  -  t), 

M2(xi,X2,t,tr)  =  X2  +  62{tr —t).  (90) 

Using  these  motion  mappings  and  proceeding  as  in  the  single  frame  case,  we  obtain 
p  N  N 

fc,b(%h  ®2t  t,  A)  )  ]  bj  (A)  y  ^  y  ^  fj\bl\,n2itr\'^f{x\  ■  h  <^l(tj-  t/x2  .  H  <52(tr  t))- 

j=l  ni=0n2=0  *  2 

(91) 

Continuing  with  spectral  filtering  and  sampling  on  a  low-resolution  grid  gives 
p  N  N 

9i(m1,m2)  =  V  Wij  /L  /j[ni,n2]7(mi  -  ^  +  5i(tr  -  t),m2  -  ^  +  S2(tr  - t )).  (92) 

j=l  m=0n2=0 

From  Eq.  (92)  we  can  see  that  the  same  interpretation  given  in  the  previous  section  applies  to  the 
translational  motion  case  with  a  slight  change.  In  this  case,  the  effective  blur  window  is  moving  at 
the  same  speed  and  in  same  direction  as  the  global  translational  motion. 

4.8.3  Experimental  Results 

We  designed  and  conducted  several  experiments  to  test  the  proposed  method.  In  these  experiments, 
we  used  two  hyper-spectral  data  sets.  The  first  set  is  from  the  University  of  Pennsylvania  database 
located  at  http://color.psych.upenn.edu/hyperspectral..  These  images  are  captured  under  controlled 
illumination  lab  environment  and  have  31  spectral  bands  (Q=31).  The  second  data  set  used  in  the 
experiments  is  a  224  band  (Q=224)  image  of  an  urban  scene  captured  by  AVIRIS.  (The  exact  name 
of  the  picture  is  Moffett  Field.  For  detailed  information  on  the  data  set  see  [165].  Since  the  AVIRIS 
data  includes  frequencies  far  beyond  the  visible  range  it  is  meaningless  to  try  to  render  RGB  images 
for  visual  evaluation.  Therefore,  to  present  visual  results  we  select  some  specific  spectral  band  (for 
all  the  visual  results  presented  in  this  report  the  hundredth  band  is  used)  and  demonstrate  the 
images  corresponding  to  this  band.  To  simulate  the  hyper-spectral  imaging  process,  these  high- 
quality  images  are  blurred  and  downsampled  in  both  spatial  and  spectral  domains.  The  resulting 
observations  have  15  and  112  spectral  bands,  respectively,  and  are  half  the  size  of  the  originals 
spatially.  The  proposed  reconstruction  technique  was  tested  under  translational  and  affine  motion 
scenarios.  Before  giving  the  detailed  simulation  results  we  briefly  discuss  these  motion  models  and 
their  relevance  to  the  aerial  and  space  imagery.  As  a  notational  convenience,  we  will  denote  the 
two  dimensional  displacement  as  dTit(x )  =  x(t)  -  x(t)  where  x(t)  -  (x(t),  y(t))T  is  the  2D  position 
at  time  t. 

•  Translational  Motion:  The  translational  motion  model  is  limited  to  simple  shifts  in  the  image 
and  the  displacement  vector  can  be  written  as  follows: 

d(x)  =  [  £  ]  (93) 
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•  Affine  Motion:  The  six  parameter  affine  motion  model  can  handle  shifts,  rotations  and  zooming. 
The  displacement  vector  depends  on  the  specific  pixel  location  and  can  be  written  as: 


d(x)  = 


ki  A;2 

kb  65 


x  + 


&3 


(94) 


Considering  the  conditions  under  which  the  hyperspectral  image  data  is  acquired,  one  can  be 
convinced  that  these  are  relevant  and  realistic  motion  models.  For  each  motion  scenario  we  have 
three  different  blur/downsample  configurations: 

Case  1 


•  5  x  5  Gaussian  spatial  blur  filter  with  a  variance  of  two. 

•  Gaussian  spectral  blur  filter  with  unit  variance. 

•  Downsampling  ratio  is  four  in  both  vertical  and  horizontal  directions. 

•  For  the  multi-frame  case  eight  observation  images  are  used. 

Case  2 

•  5  x  5  Gaussian  spatial  blur  filter  with  a  variance  of  two. 

•  Gaussian  spectral  blur  filter  with  unit  variance. 

•  Downsampling  ratio  is  two  in  both  vertical  and  horizontal  directions. 

•  For  the  multi-frame  case  four  observation  images  are  used. 

Case  3 

•  3  x  3  Gaussian  spatial  blur  filter  with  unit  variance 

•  Gaussian  spectral  blur  filter  with  unit  variance. 

•  Downsampling  ratio  is  two  in  both  vertical  and  horizontal  directions. 

•  For  the  multi-frame  case  two  observation  images  are  used. 


The  motion  vectors  are  calculated  by  applying  the  block  matching  algorithm  [162]  on  the  properly 
upsampled  images.  Visual  results  are  presented  in  Figures  52,  53  and  54  and  numerical  results  are 
presented  in  the  following  section.  All  the  images  are  RGB  rendered  from  the  hyperspectral  outputs 
for  qualitative  comparison.  Close  examination  of  these  figures  shows  that  the  spatial  resolution  has 
been  improved  during  the  reconstruction. 

We  provide  the  following  simulation  results  to  demonstrate  the  proposed  method  under  the  three 
scenarios  mentioned  above  together  with  the  results  of  bilinearly  interpolating  the  separate  spectral 
bands.  All  the  results  given  in  the  tables  below  are  PSNR  values  in  dB. 
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Prom  the  tables  we  can  see  that  the  proposed  method  with  multiple  observations  performs  much 
better  than  bilinear  interpolation. 


Numerical  results  in  terms  of  two  different  fidelity  measures  are  presented  below.  These  measures 
are  PNSR  and  band-averaged  PNSR.  In  this  paper  PSNR  is  defined  as 

PSNR  =  10  logi0  dB  (95) 

where  Speak  stands  for  the  peak  signal  power,  and  band-averaged,  PNSR  is  defined  as 

APSNR  =  10 log10  dB  (96) 

where  Spea^i  stands  for  the  peak  signal  power  in  the  ith  spectral  band.  Since  the  data  we  work  on  is 
not  quantized,  the  maximum  signal  value  is  not  fixed.  The  band-averaged  PNSR  (APSNR),  for 
which  the  numerator  is  calculated  as  the  average  of  the  peak  signal  powers  of  all  bands,  is  selected 
to  compensate  for  this  fact. 


AVIRIS  Reflectance  Data  -  1 

Bilinear 

interpolation 

Single-cube  POCS 
(no  motion) 

Multi-cube  POCS 
(translational) 

Multi-cube  POCS 
(affine) 

—  Case  1 

— 

PSNR 

23.9348 

24.3563 

25.2773 

24.9631 

APSNR 

21.1726 

21.5942 

22.5152 

22.1395 

—  Case  5 

_ 

PSNR 

22.9235 

23.0498 

24.5638 

24.2294 

APSNR 

20.1614 

20.2877 

21.8017 

21.6035 

—  Case  c 

t  — 

PSNR 

22.8220 

23.1218 

24.1632 

23.9102 

APSNR 

20.0599 

20.3597 

21.4011 

21.0567 

AVIRIS  Reflectance  Data  -  2 

Bilinear 

interpolation 

Single-cube  POCS 
(no  motion) 

Multi-cube  POCS 
(translational) 

Multi-cube  POCS 
(affine) 

—  Case  ] 

— 

PSNR 

24.2661 

24.5539 

25.3473 

25.1096 

APSNR 

18.4326 

18.7204 

19.5138 

19.2319 

—  Case  2  — 

PSNR 

23.5391 

23.6077 

24.7463 

24.5771 

APSNR 

17.6878 

17.7564 

18.8950 

18.6409 

—  Case  c 

1  — 

PSNR 

23.4939 

23.6598 

24.4973 

24.2079 

APSNR 

17.6426 

17.8085 

18.5560 

18.3508 
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AVIRIS  Radiance  Data  -  1 

Bilinear 

interpolation 

Single-cube  POCS 
(no  motion) 

Multi-cube  POCS 
(translational) 

Multi-cube  POCS 
(affine) 

—  Case  1 

— 

PSNR 

36.1382 

38.3745 

42.8398 

42.5970 

APSNR 

28.9405 

31.1768 

35.6421 

35.3455 

—  Case  5 

_ 

PSNR 

31.9181 

32.2407 

39.3321 

39.0261 

APSNR 

24.7204 

25.0430 

32.1344 

31.8971 

—  Case  c 

1  — 

PSNR 

31.7053 

32.5597 

37.1517 

36.9012 

APSNR 

24.5076 

25.3620 

29.9540 

29.6941 

AVIRIS  Radiance  Data  -  2 

Bilinear 

interpolation 

Single-cube  POCS 
(no  motion) 

Multi-cube  POCS 
(translational) 

Multi-cube  POCS 
(affine) 

—  Case  ] 

— 

PSNR 

35.9959 

37.6029 

40.5035 

40.3064 

APSNR 

29.1380 

30.7451 

33.6456 

33.3822 

—  Case  2  — 

PSNR 

32.4086 

32.7777 

37.8406 

37.5928 

APSNR 

25.5500 

25.9190 

30.9820 

30.7559 

—  Case  c 

5  — 

PSNR 

32.1307 

33.1518 

36.0935 

35.8842 

APSNR 

25.2721 

26.2932 

29.2349 

28.9151 

From  the  tables  above  we  can  see  that  the  proposed  method  even  with  a  single  source  cube  performs 
better  than  bilinear  interpolation.  Using  multiple  cubes  further  improves  the  results,  thus  pointing 
out  the  advantage  of  fusing  the  information  present  across  overlapping  sources.  Visual  results 
presented  in  Figures  (52),  (53),  (54)and  (55)  also  confirm  the  improvement  seen  in  PSNR  and 
APSNR  values.  A  careful  inspection  of  the  tables  also  reveals  that  the  PSNR  and  APSNR 
values  for  the  Moffett  Field  reflectance  images  are  10  to  15  deciBels  lower  than  the  values  for  the 
radiance  images.  This  offset  is  caused  by  the  high  frequency  components  present  in  the  reflectance 
images.  Figure  (46)  shows  the  2D  FFT  of  some  specific  spectral  bands  of  the  test  images.  For  the 
Moffett  Field  images  the  hundredth  spectral  band  is  used  and  for  the  BearFruitGray  images  it  is 
the  tenth  spectral  band.  Note  that  the  values  of  the  color-map  bars  are  in  (natural)  logarithmic 
scale.  From  these  plots  we  can  clearly  see  that  the  reflectance  images  have  larger  high  frequency 
components  compared  to  the  radiance  images.  When  filtered  with  a  low-pass  filter  such  as  the  blur 
filters  we  use  to  obtain  our  observation  images,  these  components  are  heavily  degraded  if  not  totally 
lost.  This  lost  information  is  the  main  reason  for  the  offset  between  the  radiance  and  reflectance 
images. 
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(a)  Radiance  1  (b)  Reflectance  1 


Figure  46:  Frequency  contents  of  some  specific  bands  of  the  test  images.  For  the  Moffett  Field 
images  the  hundredth  spectral  band  is  used. 
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4.8.4  Figures 


(c)  (d) 


Figure  47:  Experimental  results,  (a)  Original  image,  (b)  One  of  the  observations,  (c)  Single-frame 
reconstruction  result,  (d)  Multi- frame  reconstruction  result.  (Four  observations  used.) 
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(c)  (d) 


Figure  48:  Experimental  results,  (a)  Original  image,  (b)  One  of  the  observations,  (c)  Single-frame 
reconstruction  result,  (d)  Multi-frame  reconstruction  result.  (Four  observations  used.) 
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(a)  Original 


(b)  Bilinear  (c)  Multi  frame 


(d)  Bilinear  (e)  Multi  frame 


(f)  Bilinear  (g)  Multi  frame 


Figure  49:  Results  for  test  image  chosen  from  block  31. 
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(a)  Original 


(f)  Bilinear  (g)  Multi  frame 


Figure  50:  Results  for  test  image  chosen  from  block  34. 
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(a)  Original 


.1  f  T  T  ~ 

NMMHMMH 

(b)  Bilinear 

(c)  Multi  frame 
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(d)  Bilinear 

(e)  Multi  frame 

(f)  Bilinear  (g)  Multi  frame 

Figure  51:  Results  for  test  image  chosen  from  block  44. 
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(a)  Original 


(b)  Bilinear  (c)  Single  cube  (d)  Multi  cube 


(e)  Bilinear  (f)  Single  cube  (g)  Multi  cube 


(h)  Bilinear  (i)  Single  cube  (j)  Multi  cube 


Figure  52:  Results  for  the  first  reflectance  test  image  extracted  from  224-band  Moffett  Field. 
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(b)  Bilinear  (c)  Single  cube  (d)  Multi  cube 


(e)  Bilinear  (f)  Single  cube  (g)  Multi  cube 


(h)  Bilinear  (i)  Single  cube  (j)  Multi  cube 


Figure  53:  Results  for  the  second  reflectance  test  image  extracted  from  224-band  Moffett  Field. 
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(a)  Original 


(b)  Bilinear  (c)  Single  cube  (d)  Multi  cube 


(e)  Bilinear  (f)  Single  cube  (g)  Multi  cube 


(h)  Bilinear  (i)  Single  cube  (j)  Multi  cube 


Figure  54:  Results  for  the  first  radiance  test  image  extracted  from  224-band  Moffett  Field. 
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(a)  Original 


(b)  Bilinear  (c)  Single  cube  (d)  Multi  cube 


(e)  Bilinear  (f)  Single  cube  (g)  Multi  cube 
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(h)  Bilinear  (i)  Single  cube  (j)  Multi  cube 


Figure  55:  Results  for  the  second  radiance  test  image  extracted  from  224-band  Moffett  Field. 
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